Overbooking Analytics: Exploratory and Inferential Analysis of a Nigerian Domestic Airline’s Flight Operations (2025)

Author

Ogechi Agu-Uzochukwu

Published

May 18, 2026


1 Executive Summary

A Nigerian domestic airline practises deliberate overbooking — selling more seats than aircraft capacity — to offset revenue losses from passenger no-shows. The core trade-off is precise: flying with empty seats costs the airline lost fare revenue; denying a passenger boarding costs compensation, regulatory liability, and brand damage. Neither extreme is acceptable. The recommended overbooking threshold sits at the point where no-show absorption and overbook exposure are in balance.

This study analyses 2,153 domestic flight departures across the full calendar year 2025, covering 16 routes between seven airports: Abuja (ABV), Lagos (LOS), Port Harcourt (PHC), Benin (BNI), Jos (JOS), Asaba (ABB), and Warri (QRW). Three aircraft types are operated — 70-seat, 146-seat, and 149-seat — each with distinct booking dynamics. Five analytical techniques (EDA, Visualisation, Hypothesis Testing, Correlation, and Linear Regression) are applied to determine the structural drivers of overbooking efficiency and passenger compensation.

Key findings: (i) the airline currently overbooks at a mean rate of 3.85 seats per flight, but no-shows average 5.56, meaning the overbook buffer is consistently absorbed on most flights; (ii) on 294 flights (13.7% of all departures), overbooking exceeded no-shows, triggering passenger compensation; (iii) 625 flights (29%) still carried empty seats despite overbooking, indicating the current overbook factor is below the no-show mean on many routes; (iv) aircraft type and month are significantly associated with compensation event frequency, based on non-parametric and chi-squared hypothesis testing.

Recommendation: The airline should adopt a route × aircraft-type × month overbook rate matrix where the target overbook count is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is chosen because it represents a deliberate asymmetry: the cost of one denied boarding (compensation, regulatory liability, and reputational damage) materially exceeds the cost of one empty seat (foregone marginal revenue only), so the threshold is set above the median to absorb no-shows on most departures while accepting a controlled residual risk of 20% rather than 50%. This eliminates empty-seat losses on 80% of departures while accepting a controlled compensation event rate of approximately 20%.


2 Professional Disclosure

Role: Airport Operations Manager at a Nigerian domestic airline — responsible for station operations, ensuring safe and on-time flight departures, customer service efficiency, and passenger satisfaction. The role controls station operational costs, manages flight delays, cancellations, and operational disruptions, resolves customer complaints, and oversees the ground operations team.

Organisation: A Nigerian domestic airline operating scheduled passenger services across seven airports in the domestic aviation sector. The organisation’s core business is time-critical passenger movement on high-frequency short-haul routes, where seat yield, on-time performance, and passenger experience are the primary operational and commercial metrics.

Technique relevance to daily work:

# Technique Why it matters in this role
1 EDA I review daily departure data to identify patterns in overbooking, no-shows, and compensation events. EDA surfaces distributional anomalies — e.g., atypically high overbook counts on specific routes — that would be buried in weekly summary reports
2 Visualisation Station performance dashboards are presented to the Head of Ground Operations every Monday. Clear visualisations of the overbook-vs-no-show balance and compensation frequency translate operational data into actions that non-technical managers can implement immediately
3 Hypothesis Testing When compensation incidents spike on a route, I need statistical evidence to determine whether the pattern is a genuine policy failure or random variation before escalating a formal policy-change request to Revenue Management
4 Correlation When compensation events spike, I need to identify which variable is the dominant driver before escalating. If overbook count drives compensation, the case goes to Revenue Management to revise the booking limit. If no-show variability drives it, the case goes to Revenue Management’s demand forecasting function to recalibrate the no-show model. Correlation analysis tells me which argument to make and to whom
5 Linear Regression Every compensation event I manage at the gate generates costs I am accountable for: passenger re-accommodation, delay, complaints, and regulatory exposure. When I present a policy-change request to Revenue Management, anecdote is insufficient — I need a quantified argument. A regression model that relates compensation frequency to the overbook rate, aircraft type, and month gives me that argument: the coefficients show exactly how many additional compensation events each extra oversold seat produces, by aircraft type and by season. Table 12 and Figure 5 express this as a concrete overbook-rate matrix that Revenue Management can act on directly

3 Data Collection & Sampling

Source: Airline departure control system (DCS).

Collection method: Extracted DCS data for the period and saved it in Excel format.

Sampling frame: All domestic scheduled departures in calendar year 2025, across seven airports and 16 origin–destination routes.

Sample size: 2,153 flight departures, 29 unique flight numbers, 3 aircraft types.

Time period: 1 January 2025 – 31 December 2025 (full calendar year; no gaps).

Variables collected (12 source + 5 derived):

Code
var_info <- tibble(
  Variable = c("Flt No","Dep Port","Arr Port","Route","Sch Dep Dt","Months",
               "Capacity","Seat Sold","Overbook","Flown","NoShow","Compensate",
               "— empty_seats","— ob_rate","— ns_rate","— net_exposure","— denied_flag"),
  Source   = c(rep("Source",12), rep("Derived",5)),
  Type     = c("numeric","categorical","categorical","categorical","date","ordinal",
               "numeric","numeric","numeric","numeric","numeric","numeric",
               "numeric","numeric","numeric","numeric","binary"),
  Role     = c(
    "Flight identifier",
    "Departure airport (categorical predictor)",
    "Arrival airport (categorical predictor)",
    "Origin–destination pair (categorical predictor)",
    "Scheduled departure date (time index)",
    "Calendar month 1–12 (ordinal predictor)",
    "Aircraft seat capacity: 70, 146, or 149 (binary predictor)",
    "Total seats sold per departure",
    "Oversold seats per departure (primary policy variable)",
    "Actual passengers who boarded (≤ Capacity)",
    "Passengers who did not board (key buffer variable)",
    "Passengers compensated for denied boarding (primary outcome)",
    "Capacity − Flown: empty seats on departure (lost-revenue metric)",
    "Overbook / Capacity: overbooking rate (normalised policy variable)",
    "NoShow / Seat Sold: no-show rate per departure",
    "Overbook − NoShow: >0 triggers compensation",
    "Compensate > 0 indicator (Yes/No)"
  )
)

var_info |>
  kable(
    caption   = "Table 1: Variable Inventory — Overbook_Jay_new.csv",
    col.names = c("Variable", "Source", "Type", "Role"),
    align     = c("l","c","l","l")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  row_spec(0, bold = TRUE, background = "#F5F5F5") |>
  row_spec(13:17, italic = TRUE, color = "grey40") |>
  column_spec(1, bold = TRUE, width = "12em") |>
  column_spec(2, width = "6em") |>
  column_spec(3, width = "8em")
Table 1: Variable Inventory — Overbook_Jay_new.csv
Variable Source Type Role
Flt No Source numeric Flight identifier
Dep Port Source categorical Departure airport (categorical predictor)
Arr Port Source categorical Arrival airport (categorical predictor)
Route Source categorical Origin–destination pair (categorical predictor)
Sch Dep Dt Source date Scheduled departure date (time index)
Months Source ordinal Calendar month 1–12 (ordinal predictor)
Capacity Source numeric Aircraft seat capacity: 70, 146, or 149 (binary predictor)
Seat Sold Source numeric Total seats sold per departure
Overbook Source numeric Oversold seats per departure (primary policy variable)
Flown Source numeric Actual passengers who boarded (≤ Capacity)
NoShow Source numeric Passengers who did not board (key buffer variable)
Compensate Source numeric Passengers compensated for denied boarding (primary outcome)
— empty_seats Derived numeric Capacity − Flown: empty seats on departure (lost-revenue metric)
— ob_rate Derived numeric Overbook / Capacity: overbooking rate (normalised policy variable)
— ns_rate Derived numeric NoShow / Seat Sold: no-show rate per departure
— net_exposure Derived numeric Overbook − NoShow: >0 triggers compensation
— denied_flag Derived binary Compensate > 0 indicator (Yes/No)

Data loading:

Code
# Source file: Overbook_Jay_new.xlsb extracted from DCS and saved as Excel,
# then converted to CSV using pyxlsb before loading into R
df_raw <- read_csv(
  "Overbook_Jay_new.csv",
  col_types = cols(
    `Flt No`     = col_double(),
    `Dep Port`   = col_character(),
    `Arr Port`   = col_character(),
    `Route`      = col_character(),
    `Sch Dep Dt` = col_date(format = "%Y-%m-%d"),
    `Months`     = col_double(),
    `Capacity`   = col_double(),
    `Seat Sold`  = col_double(),
    `Overbook`   = col_double(),
    `Flown`      = col_double(),
    `NoShow`     = col_double(),
    `Compensate` = col_double()
  )
)

Data cleaning and derived variables:

Code
df <- df_raw |>
  clean_names() |>
  mutate(
    date       = sch_dep_dt,
    month_num  = as.integer(months),
    month_lbl  = factor(month_num, levels = 1:12, labels = month.abb, ordered = TRUE),
    ac_type    = factor(
      case_when(
        capacity == 70  ~ "Small (70)",
        capacity == 146 ~ "Medium (146)",
        capacity == 149 ~ "Large (149)",
        TRUE            ~ "Other"
      ),
      levels = c("Small (70)", "Medium (146)", "Large (149)")
    ),
    dep_port     = factor(dep_port),
    arr_port     = factor(arr_port),
    route        = factor(route),
    empty_seats  = capacity - flown,
    ob_rate      = overbook / capacity,
    ns_rate      = no_show / seat_sold,
    net_exposure = overbook - no_show,
    denied_flag  = if_else(compensate > 0, "Compensated", "No Compensation")
  ) |>
  select(-sch_dep_dt)

# Consistent colour palette used across all plots
pal_blue   <- "#1F497D"
pal_orange <- "#C55A11"
pal_mid    <- "#2E75B6"
pal_green  <- "#375E23"
pal_3      <- c("Large (149)" = pal_blue, "Medium (146)" = pal_mid, "Small (70)" = pal_orange)

Ethical note: The dataset contains no personally identifiable information (PII). All records are flight-level aggregates — no passenger names, booking references, or employee IDs are present. Route-level performance statistics are standard internal reporting at the airline and are published here solely for academic analysis under Lagos Business School supervision.


4 Data Description

Code
cat("Dimensions:", nrow(df), "rows ×", ncol(df), "columns\n")
Dimensions: 2153 rows × 20 columns
Code
cat("Missing values per column:\n")
Missing values per column:
Code
colSums(is.na(df)) |> print()
      flt_no     dep_port     arr_port        route       months     capacity 
           0            0            0            0            0            0 
   seat_sold     overbook        flown      no_show   compensate         date 
           0            0            0            0            0            0 
   month_num    month_lbl      ac_type  empty_seats      ob_rate      ns_rate 
           0            0            0            0            0            0 
net_exposure  denied_flag 
           0            0 
Code
skim(df |> select(capacity, overbook, flown, no_show, compensate,
                  empty_seats, ob_rate, ns_rate, net_exposure, month_num))
Data summary
Name select(…)
Number of rows 2153
Number of columns 10
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
capacity 0 1 122.86 36.02 70.00 70.00 146.00 149.00 149.00 ▃▁▁▁▇
overbook 0 1 3.85 2.78 1.00 2.00 3.00 5.00 38.00 ▇▁▁▁▁
flown 0 1 120.76 36.21 60.00 70.00 144.00 146.00 149.00 ▃▁▁▁▇
no_show 0 1 5.56 4.25 0.00 2.00 4.00 8.00 20.00 ▇▃▂▁▁
compensate 0 1 0.39 1.33 0.00 0.00 0.00 0.00 21.00 ▇▁▁▁▁
empty_seats 0 1 2.10 3.47 0.00 0.00 0.00 5.00 10.00 ▇▁▁▁▁
ob_rate 0 1 0.03 0.02 0.01 0.01 0.03 0.04 0.30 ▇▁▁▁▁
ns_rate 0 1 0.05 0.04 0.00 0.01 0.03 0.07 0.24 ▇▃▁▁▁
net_exposure 0 1 -1.71 3.93 -10.00 -5.00 0.00 0.00 21.00 ▃▇▁▁▁
month_num 0 1 6.44 3.06 1.00 4.00 6.00 9.00 12.00 ▆▇▆▇▆
Code
p_hist <- function(var, label, fill_col) {
  ggplot(df, aes(x = .data[[var]])) +
    geom_histogram(bins = 30, fill = fill_col, colour = "white", alpha = 0.85) +
    scale_y_continuous(labels = comma) +
    labs(x = label, y = "Count") +
    theme_minimal(base_size = 11)
}

p1 <- p_hist("overbook",    "Overbook (seats)", pal_blue)
p2 <- p_hist("no_show",     "No-Shows (pax)",   pal_mid)
p3 <- p_hist("compensate",  "Compensate (pax)", pal_orange)
p4 <- ggplot(df, aes(x = ac_type, fill = ac_type)) +
        geom_bar(show.legend = FALSE) +
        scale_fill_manual(values = pal_3) +
        scale_y_continuous(labels = comma) +
        labs(x = "Aircraft Type", y = "Departures") +
        theme_minimal(base_size = 11)

(p1 | p2) / (p3 | p4) +
  plot_annotation(
    title    = "Figure 1: Distributions of Core Variables and Aircraft Mix",
    subtitle = "Overbook is right-skewed; Compensate is zero-inflated; three aircraft types operate",
    theme    = theme(plot.title = element_text(face = "bold", size = 13))
  )

Figure 1: Distributions of key numeric variables

Key observations: The overbook distribution is right-skewed (most flights oversell 2–5 seats; a small number oversell 20+). Compensate (denied boarding with compensation) is heavily zero-inflated — 1,859 of 2,153 flights (86.3%) had zero compensation events. No-show counts are right-skewed with a mode around 4–6, a mean of 5.56, and a long right tail extending to 20 — consistent with count data exhibiting overdispersion rather than a normal distribution. The fleet splits across three aircraft types: Small 70-seat (682 flights), Medium 146-seat (802 flights), and Large 149-seat (669 flights).


5 Technique 1 — Exploratory Data Analysis (EDA)

Book reference: Chapter 4 — Summary statistics, missing-value analysis, outlier detection, Anscombe’s Quartet.

Business justification: EDA is the mandatory first step before any overbooking policy recommendation. It reveals whether the data is clean, identifies operational anomalies, and determines whether the overbook rate is systematically above or below the no-show buffer across different route and aircraft segments. At a Nigerian domestic airline, a single high-overbook departure (e.g., a mis-priced group block) can distort network-level averages if not identified and understood before building a predictive model.

Code
# ── Data quality issue 1: Verify the no-show absorption rule ─────────────
# Expected: Compensate > 0 if and only if Overbook > NoShow
rule_check <- df |>
  mutate(
    rule_holds = (compensate > 0) == (net_exposure > 0)
  ) |>
  summarise(
    total_flights         = n(),
    compensate_events     = sum(compensate > 0),
    net_positive_exposure = sum(net_exposure > 0),
    rule_consistent       = sum(rule_holds),
    rule_violation_pct    = round((1 - mean(rule_holds)) * 100, 2)
  )

rule_check |>
  kable(
    caption   = "Table 2: Absorption Rule Verification (Compensate ↔ Overbook > NoShow)",
    col.names = c("Total Flights","Compensate Events","Net Exposure > 0",
                  "Rule Consistent","Rule Violation %")
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 2: Absorption Rule Verification (Compensate ↔︎ Overbook > NoShow)
Total Flights Compensate Events Net Exposure > 0 Rule Consistent Rule Violation %
2153 294 294 2153 0
Code
# ── Data quality issue 2: Empty seats on overbooked flights ──────────────
# If overbook > 0 AND empty_seats > 0, the overbook failed to fill the plane
empty_ob <- df |>
  filter(empty_seats > 0) |>
  summarise(
    flights_with_empty = n(),
    pct_of_total       = round(n() / nrow(df) * 100, 1),
    avg_empty          = round(mean(empty_seats), 2),
    max_empty          = max(empty_seats),
    revenue_exposure   = paste0("Avg ", round(mean(empty_seats), 1), " empty seats per affected flight")
  )

empty_ob |>
  kable(
    caption   = "Table 3: Flights with Empty Seats Despite Overbooking — Revenue Leakage",
    col.names = c("Flights with Empty Seats","% of Total","Avg Empty","Max Empty","Revenue Note")
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 3: Flights with Empty Seats Despite Overbooking — Revenue Leakage
Flights with Empty Seats % of Total Avg Empty Max Empty Revenue Note
625 29 7.22 10 Avg 7.2 empty seats per affected flight
Code
# ── Outlier detection on Overbook ────────────────────────────────────────
Q1  <- quantile(df$overbook, 0.25)
Q3  <- quantile(df$overbook, 0.75)
IQR_val     <- Q3 - Q1
upper_fence <- Q3 + 1.5 * IQR_val

outliers_ob <- df |> filter(overbook > upper_fence)
cat("Overbook IQR upper fence:", upper_fence,
    "| Outlier flights:", nrow(outliers_ob),
    "(", round(nrow(outliers_ob)/nrow(df)*100, 1), "%)\n")
Overbook IQR upper fence: 9.5 | Outlier flights: 79 ( 3.7 %)
Code
# Top 10 highest-overbook flights
df |>
  arrange(desc(overbook)) |>
  select(date, route, ac_type, capacity, seat_sold, overbook,
         no_show, empty_seats, compensate) |>
  slice_head(n = 10) |>
  kable(
    caption   = "Table 4: Top 10 Highest-Overbook Flights",
    col.names = c("Date","Route","Aircraft","Capacity","Sold","Overbook",
                  "NoShow","Empty Seats","Compensate")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 4: Top 10 Highest-Overbook Flights
Date Route Aircraft Capacity Sold Overbook NoShow Empty Seats Compensate
2025-07-20 ABVLOS Medium (146) 146 184 38 17 0 21
2025-11-16 LOSABV Medium (146) 146 167 21 9 0 12
2025-09-18 BNILOS Small (70) 70 91 21 20 0 1
2025-10-10 PHCLOS Large (149) 149 169 20 11 0 9
2025-12-21 ABVLOS Medium (146) 146 165 19 4 0 15
2025-04-13 ABVLOS Medium (146) 146 164 18 18 0 0
2025-12-13 ABVPHC Medium (146) 146 164 18 8 0 10
2025-08-03 ABVLOS Large (149) 149 166 17 12 0 5
2025-07-10 LOSABV Medium (146) 146 163 17 15 0 2
2025-12-01 LOSABV Medium (146) 146 163 17 14 0 3
Code
# ── Monthly performance summary ──────────────────────────────────────────
df |>
  group_by(month_lbl) |>
  summarise(
    Flights         = n(),
    Avg_Overbook    = round(mean(overbook), 2),
    Avg_NoShow      = round(mean(no_show), 2),
    Avg_Compensate  = round(mean(compensate), 3),
    Avg_EmptySeats  = round(mean(empty_seats), 2),
    Comp_Rate       = paste0(round(mean(compensate > 0) * 100, 1), "%")
  ) |>
  kable(
    caption   = "Table 5: Monthly Performance Summary",
    col.names = c("Month","Flights","Avg Overbook","Avg NoShow",
                  "Avg Compensate","Avg Empty Seats","% Flights with Compensation")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 5: Monthly Performance Summary
Month Flights Avg Overbook Avg NoShow Avg Compensate Avg Empty Seats % Flights with Compensation
Jan 95 2.92 4.80 0.326 2.21 14.7%
Feb 154 3.39 5.09 0.253 1.95 9.7%
Mar 128 3.23 5.09 0.258 2.12 9.4%
Apr 311 3.56 5.35 0.328 2.12 11.6%
May 221 3.20 5.02 0.199 2.02 11.8%
Jun 178 3.68 5.37 0.264 1.96 11.8%
Jul 237 4.04 5.92 0.409 2.29 13.1%
Aug 255 3.46 5.40 0.278 2.22 12.9%
Sep 208 3.90 5.73 0.308 2.13 13.9%
Oct 106 4.42 6.44 0.387 2.41 13.2%
Nov 98 5.05 6.23 0.816 2.00 19.4%
Dec 162 6.10 6.66 1.130 1.69 27.2%
Code
import pandas as pd
import numpy as np

df_py = pd.read_csv("Overbook_Jay_new.csv", parse_dates=["Sch Dep Dt"])
df_py.columns = [c.lower().replace(" ", "_") for c in df_py.columns]
df_py = df_py.rename(columns={"noshow": "no_show", "sch_dep_dt": "date"})

# Derived columns
df_py["empty_seats"]  = df_py["capacity"] - df_py["flown"]
df_py["ob_rate"]      = df_py["overbook"] / df_py["capacity"]
df_py["net_exposure"] = df_py["overbook"] - df_py["no_show"]

print("=== Shape:", df_py.shape)
=== Shape: (2153, 15)
Code
print("\n=== Missing values:\n", df_py.isnull().sum().to_string())

=== Missing values:
 flt_no          0
dep_port        0
arr_port        0
route           0
date            0
months          0
capacity        0
seat_sold       0
overbook        0
flown           0
no_show         0
compensate      0
empty_seats     0
ob_rate         0
net_exposure    0
Code
print("\n=== Numeric summary:")

=== Numeric summary:
Code
cols = ["capacity","overbook","flown","no_show","compensate","empty_seats","ob_rate"]
print(df_py[cols].describe().round(3).to_string())
       capacity  overbook     flown   no_show  compensate  empty_seats   ob_rate
count  2153.000  2153.000  2153.000  2153.000    2153.000     2153.000  2153.000
mean    122.858     3.849   120.761     5.559       0.386        2.097     0.033
std      36.021     2.782    36.206     4.247       1.327        3.471     0.023
min      70.000     1.000    60.000     0.000       0.000        0.000     0.007
25%      70.000     2.000    70.000     2.000       0.000        0.000     0.014
50%     146.000     3.000   144.000     4.000       0.000        0.000     0.029
75%     149.000     5.000   146.000     8.000       0.000        5.000     0.043
max     149.000    38.000   149.000    20.000      21.000       10.000     0.300
Code
print("\n=== Aircraft type distribution:")

=== Aircraft type distribution:
Code
print(df_py["capacity"].value_counts().sort_index().to_string())
capacity
70.0     682
146.0    802
149.0    669
Code
print("\n=== Compensation events:", (df_py["compensate"] > 0).sum(),
      "out of", len(df_py), "flights")

=== Compensation events: 294 out of 2153 flights
Code
print("=== Flights with empty seats:", (df_py["empty_seats"] > 0).sum())
=== Flights with empty seats: 625

EDA findings:

  1. Data quality confirmed — no missing values. All 2,153 rows are complete across all 12 source variables. Because the dataset contains the full population of departures within the study period, traditional sampling-selection concerns are reduced.

  2. The absorption rule holds perfectly. Compensation events occur if and only if Overbook > NoShow (i.e., net_exposure > 0). This is not an accounting artefact — it reflects the airline’s confirmed policy: every seat sold beyond no-show absorption results in a denied boarding with compensation. The rule holds in 100% of rows.

  3. Revenue leakage from empty seats (625 flights, 29%). Despite overbooking on every departure, 625 flights still landed with empty seats — meaning the current overbook factor was insufficient to compensate for the actual no-show volume on those days. This is the underperformance side of the trade-off and a direct revenue loss signal.

  4. Outlier overbook flights exist but are operationally explainable. Flights where overbook > 12 (IQR upper fence) represent charter-augmented or group-movement departures. They are retained as genuine observations but noted in regression diagnostics.


6 Technique 2 — Data Visualisation

Book reference: Chapter 5 — Grammar of graphics, chart selection, storytelling with data.

Business justification: Revenue Management and Operations leadership are not data scientists. A well-chosen visualisation removes ambiguity from a statistical summary and makes seasonal and route-level patterns immediately actionable. The five plots below tell one coherent story: where, when, and on which aircraft the balance between overbooking and no-show absorption breaks down — and what that means for the compensation budget.

Code
p_v1 <- df |>
  group_by(month_lbl) |>
  summarise(
    Avg_Overbook   = mean(overbook),
    Avg_NoShow     = mean(no_show),
    Avg_EmptySeats = mean(empty_seats),
    .groups        = "drop"
  ) |>
  pivot_longer(c(Avg_Overbook, Avg_NoShow, Avg_EmptySeats),
               names_to = "Metric", values_to = "Value") |>
  mutate(Metric = case_match(Metric,
    "Avg_Overbook"   ~ "Avg Overbooked Seats",
    "Avg_NoShow"     ~ "Avg No-Shows",
    "Avg_EmptySeats" ~ "Avg Empty Seats at Departure"
  )) |>
  ggplot(aes(x = month_lbl, y = Value, colour = Metric, group = Metric)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_colour_manual(values = c(
    "Avg Overbooked Seats"          = pal_blue,
    "Avg No-Shows"                  = pal_green,
    "Avg Empty Seats at Departure"  = pal_orange
  )) +
  labs(
    title    = "Plot 1: Monthly Overbooking, No-Show & Empty-Seat Trends (2025)",
    subtitle = "When No-Shows > Overbook, empty seats result; when Overbook > No-Shows, compensation occurs",
    x = "Month", y = "Average per Flight", colour = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_v1

Plot 1: Monthly Overbooking and No-Show Trend
Code
p_v2 <- df |>
  group_by(route) |>
  summarise(
    Comp_Rate    = mean(compensate > 0) * 100,
    Avg_Overbook = mean(overbook),
    Avg_NoShow   = mean(no_show),
    n            = n(),
    .groups      = "drop"
  ) |>
  arrange(desc(Comp_Rate)) |>
  ggplot(aes(x = reorder(route, Comp_Rate), y = Comp_Rate, fill = Avg_Overbook)) +
  geom_col(show.legend = TRUE) +
  geom_text(aes(label = paste0(round(Comp_Rate, 1), "%")),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  coord_flip() +
  scale_fill_gradient(low = "#D6E8F5", high = pal_blue, name = "Avg\nOverbook") +
  scale_y_continuous(limits = c(0, 35), labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Plot 2: Compensation Rate by Route (shaded by average overbook count)",
    subtitle = "Some routes exceed 20% compensation frequency — a policy redesign priority",
    x = "Route", y = "% of Flights with Compensation Event"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v2

Plot 2: Compensation Rate by Route
Code
p_v3 <- ggplot(df, aes(x = no_show, y = overbook, colour = denied_flag, shape = ac_type)) +
  geom_jitter(alpha = 0.40, size = 1.8, width = 0.25, height = 0.25) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed",
              colour = "grey30", linewidth = 0.9) +
  scale_colour_manual(values = c("Compensated" = pal_orange, "No Compensation" = pal_mid),
                      name = "Outcome") +
  scale_shape_manual(values = c("Large (149)" = 16, "Medium (146)" = 17, "Small (70)" = 15),
                     name = "Aircraft") +
  annotate("text", x = 15, y = 3, label = "← No-shows absorb overbook\n(empty seats possible)",
           size = 3, colour = pal_mid, hjust = 0) +
  annotate("text", x = 1, y = 20, label = "Overbook > No-shows\n→ Compensation →",
           size = 3, colour = pal_orange, hjust = 0) +
  labs(
    title    = "Plot 3: The Overbooking Trade-Off — No-Shows vs Overbooked Seats",
    subtitle = "Points above the diagonal → overbook exceeds no-shows → passenger compensation required",
    x = "No-Shows per Departure", y = "Overbooked Seats per Departure"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v3

Plot 3: Overbook vs No-Show trade-off by aircraft type
Code
p_v4 <- ggplot(df, aes(x = ac_type, y = ob_rate * 100, fill = ac_type)) +
  geom_violin(trim = FALSE, alpha = 0.65) +
  geom_boxplot(width = 0.10, fill = "white", outlier.colour = pal_orange, outlier.size = 1.5) +
  scale_fill_manual(values = pal_3) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Plot 4: Overbooking Rate (%) Distribution by Aircraft Type",
    subtitle = "Small aircraft are booked at a higher percentage rate than large aircraft",
    x = "Aircraft Type", y = "Overbooking Rate (% of Capacity)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none", plot.title = element_text(face = "bold"))
p_v4

Plot 4: Overbook rate distribution by aircraft type
Code
p_v5 <- df |>
  group_by(month_lbl, ac_type) |>
  summarise(
    Avg_NetExposure = mean(net_exposure),
    Comp_Rate       = mean(compensate > 0) * 100,
    .groups         = "drop"
  ) |>
  ggplot(aes(x = month_lbl, y = ac_type, fill = Avg_NetExposure)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = round(Avg_NetExposure, 1)), size = 3.2, fontface = "bold",
            colour = ifelse(TRUE, "white", "black")) +
  scale_fill_gradient2(low = pal_green, mid = "white", high = pal_orange,
                       midpoint = 0, name = "Avg Net\nExposure") +
  labs(
    title    = "Plot 5: Average Net Exposure (Overbook − No-Show) by Month & Aircraft Type",
    subtitle = "Orange = more overbook than no-shows (denial risk); Green = no-shows exceed overbook (empty seats)",
    x = "Month", y = "Aircraft Type"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v5

Plot 5: Net Exposure Heatmap — Month x Aircraft Type

Visualisation narrative: The five charts collectively map the full overbooking trade-off. Plot 1 shows that monthly no-show volumes consistently exceed average overbook counts — which explains the 625 empty-seat departures — but with month-specific variation. Plot 2 identifies specific routes where the compensation rate exceeds 20%, making them priority targets for a revised overbook policy. Plot 3 is operationally the most critical: it maps the exact threshold where no-shows fail to absorb the overbook, and where compensation becomes inevitable. Plot 4 shows that small aircraft are booked at a proportionally higher rate, despite their narrower absolute buffer. Plot 5 provides the month × aircraft matrix that a policy redesign would directly use.

Chart type justification: The line chart (Plot 1) was preferred over a bar chart to emphasise temporal trends and the convergence/divergence between the two series. The jitter scatter with diagonal reference (Plot 3) was preferred over a histogram because it simultaneously shows the overbook level, the no-show level, and the compensation outcome for every flight. The diverging heatmap (Plot 5) was preferred over small-multiple bars because it encodes both the direction and magnitude of net exposure across two categorical dimensions simultaneously.


7 Technique 3 — Hypothesis Testing

Book reference: Chapter 6 — t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes.

Business justification: Observed differences in compensation rates across aircraft types and months could be due to chance variation in no-show behaviour rather than systemic overbooking policy differences. Hypothesis testing provides defensible statistical evidence to justify a formal policy change — specifically, the adoption of separate overbook factors by aircraft type and separate monthly overbook limits — before taking those recommendations to the Revenue Management team.

7.1 Hypothesis 1 — Does the net exposure (Overbook − NoShow) differ significantly across aircraft types?

\[H_0: \tilde{\mu}_{Small} = \tilde{\mu}_{Medium} = \tilde{\mu}_{Large} \quad \text{(median net exposure is equal across aircraft types)}\] \[H_1: \text{At least one aircraft type has a different median net exposure}\]

Code
# Step 1: Check normality per group (Anderson-Darling, robust for large n)
for (grp in levels(df$ac_type)) {
  x <- df |> filter(ac_type == grp) |> pull(net_exposure)
  cat(grp, "— Anderson-Darling:\n")
  print(ad.test(x))
  cat("\n")
}
Small (70) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 90.484, p-value < 2.2e-16


Medium (146) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 76.577, p-value < 2.2e-16


Large (149) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 73.696, p-value < 2.2e-16
Code
# Step 2: Non-normal data → Kruskal-Wallis (non-parametric ANOVA)
kw1 <- kruskal.test(net_exposure ~ ac_type, data = df)
print(kw1)

    Kruskal-Wallis rank sum test

data:  net_exposure by ac_type
Kruskal-Wallis chi-squared = 15.776, df = 2, p-value = 0.0003752
Code
# Step 3: Post-hoc pairwise comparisons (Wilcoxon, Bonferroni-adjusted)
pw1 <- pairwise.wilcox.test(df$net_exposure, df$ac_type,
                             p.adjust.method = "bonferroni", exact = FALSE)
pw1$p.value |>
  as.data.frame() |>
  rownames_to_column("Type A") |>
  pivot_longer(-`Type A`, names_to = "Type B", values_to = "p_adj") |>
  filter(!is.na(p_adj)) |>
  kable(caption = "Table 6: Pairwise Aircraft Type Comparisons — Net Exposure",
        col.names = c("Aircraft A","Aircraft B","Bonferroni-adjusted p"),
        digits = 4) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 6: Pairwise Aircraft Type Comparisons — Net Exposure
Aircraft A Aircraft B Bonferroni-adjusted p
Medium (146) Small (70) 0.0007
Large (149) Small (70) 1.0000
Large (149) Medium (146) 0.0069
Code
# Step 4: Effect size (epsilon-squared for Kruskal-Wallis)
effectsize::rank_epsilon_squared(net_exposure ~ ac_type, data = df)
Epsilon2 (rank) |       95% CI
------------------------------
7.33e-03        | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
# Summary statistics per aircraft type
df |>
  group_by(ac_type) |>
  summarise(
    n              = n(),
    Median_NetExp  = round(median(net_exposure), 2),
    Mean_NetExp    = round(mean(net_exposure), 2),
    Pct_Comp       = paste0(round(mean(compensate > 0)*100, 1), "%")
  ) |>
  kable(caption = "Table 7: Net Exposure Summary by Aircraft Type",
        col.names = c("Aircraft Type","n","Median Net Exposure","Mean Net Exposure",
                      "% Flights with Compensation")) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 7: Net Exposure Summary by Aircraft Type
Aircraft Type n Median Net Exposure Mean Net Exposure % Flights with Compensation
Small (70) 682 0 -1.94 9.5%
Medium (146) 802 0 -1.30 17.8%
Large (149) 669 0 -1.97 12.9%

Interpretation: A statistically significant Kruskal-Wallis result (p < 0.05) confirms that net exposure — and therefore compensation risk — is not uniform across aircraft types. The post-hoc pairwise tests reveal which specific pairs differ. Business implication: Applying a single network-wide overbook count across all three aircraft types is statistically unjustified. Each type should carry its own overbook limit calibrated to its historical no-show distribution. In particular, if small-aircraft routes show a significantly different net exposure profile, the policy applied to 70-seat services must be independently derived.

7.2 Hypothesis 2 — Is the compensation rate (% of flights with Compensate > 0) independent of month?

\[H_0: \text{The probability of a compensation event is equal across all months}\] \[H_1: \text{The probability of a compensation event differs by month}\]

Code
# Step 1: Build a contingency table: Month × Compensation (Yes/No)
cont_tbl <- table(df$month_lbl, df$compensate > 0)
colnames(cont_tbl) <- c("No Compensation","Compensated")

cont_tbl |>
  as.data.frame.matrix() |>
  rownames_to_column("Month") |>
  kable(caption = "Table 8: Contingency Table — Month × Compensation Event") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 8: Contingency Table — Month × Compensation Event
Month No Compensation Compensated
Jan 81 14
Feb 139 15
Mar 116 12
Apr 275 36
May 195 26
Jun 157 21
Jul 206 31
Aug 222 33
Sep 179 29
Oct 92 14
Nov 79 19
Dec 118 44
Code
# Step 2: Chi-squared test of independence
chi2 <- chisq.test(cont_tbl)
print(chi2)

    Pearson's Chi-squared test

data:  cont_tbl
X-squared = 34.417, df = 11, p-value = 0.000309
Code
# Step 3: Effect size — Cramér's V
cramers_v(cont_tbl)
Cramer's V (adj.) |       95% CI
--------------------------------
0.10              | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Interpretation: A chi-squared test on the contingency table tests whether month and the occurrence of a compensation event are independent. If p < 0.05, the month in which a flight operates significantly predicts whether a compensation event is likely. Business implication: A significant result supports the adoption of a monthly overbook calendar — the Revenue Management team should increase overbook limits in months where compensation events are rare (no-show rates are high) and reduce limits in months where compensation events cluster (no-show rates are low relative to the current overbook factor).


8 Technique 4 — Correlation Analysis

Book reference: Chapter 8 — Pearson, Spearman, Kendall; partial correlation; correlation vs. causation.

Business justification: Before building a predictive model, I need to determine whether it is the overbook rate or the no-show rate that is the stronger driver of compensation — and whether the two are so correlated that including both in a regression would inflate standard errors through collinearity. At station level, this distinction matters practically: if overbook count is the dominant driver, the policy lever is Revenue Management’s booking-limit decision; if no-show count dominates, the lever is Revenue Management’s no-show forecasting model and the ticketing policy (e.g., stricter cancellation fees that reduce unannounced no-shows) — neither of which is controlled at station level. Correlation analysis answers this question quantitatively.

Code
num_df <- df |>
  select(overbook, no_show, compensate, empty_seats, ob_rate, ns_rate,
         net_exposure, capacity, month_num) |>
  rename(
    Overbook       = overbook,
    `No-Show`      = no_show,
    Compensate     = compensate,
    `Empty Seats`  = empty_seats,
    `OB Rate`      = ob_rate,
    `NS Rate`      = ns_rate,
    `Net Exposure` = net_exposure,
    Capacity       = capacity,
    Month          = month_num
  )

# Spearman correlation (robust to non-normality and outliers)
cor_mat  <- cor(num_df, method = "spearman", use = "complete.obs")
cor_pmat <- cor_pmat(num_df, method = "spearman")

round(cor_mat, 3) |>
  kable(caption = "Table 9: Spearman Correlation Matrix") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)
Table 9: Spearman Correlation Matrix
Overbook No-Show Compensate Empty Seats OB Rate NS Rate Net Exposure Capacity Month
Overbook 1.000 0.537 0.323 -0.042 0.838 0.397 0.183 0.194 0.209
No-Show 0.537 1.000 -0.190 0.736 0.438 0.920 -0.660 0.133 0.102
Compensate 0.323 -0.190 1.000 -0.249 0.292 -0.253 0.663 0.040 0.090
Empty Seats -0.042 0.736 -0.249 1.000 -0.048 0.732 -0.890 0.010 -0.010
OB Rate 0.838 0.438 0.292 -0.048 1.000 0.510 0.173 -0.301 0.135
NS Rate 0.397 0.920 -0.253 0.732 0.510 1.000 -0.686 -0.202 0.042
Net Exposure 0.183 -0.660 0.663 -0.890 0.173 -0.686 1.000 0.011 0.050
Capacity 0.194 0.133 0.040 0.010 -0.301 -0.202 0.011 1.000 0.100
Month 0.209 0.102 0.090 -0.010 0.135 0.042 0.050 0.100 1.000
Code
ggcorrplot(
  cor_mat,
  hc.order   = TRUE,
  type       = "lower",
  p.mat      = cor_pmat,
  lab        = TRUE,
  lab_size   = 3.5,
  sig.level  = 0.05,
  insig      = "blank",
  colors     = c(pal_orange, "white", pal_blue),
  title      = "Figure 2: Spearman Correlation Matrix — Overbooking Variables (p < 0.05 shown)",
  ggtheme    = theme_minimal(base_size = 11)
) +
  theme(plot.title = element_text(face = "bold", size = 12))

Figure 2: Correlation Heatmap with Significance Masking

Discussion of the 3 strongest correlations:

  1. Net Exposure ↔︎ Compensate (strongest positive). Net Exposure = Overbook − NoShow is by construction the immediate trigger for compensation. A strong positive correlation is expected and confirmed. Business implication: Managing net exposure — either by reducing the overbook count or by accurately forecasting no-show volumes — is the single most effective lever for reducing compensation events.

  2. Overbook ↔︎ Compensate (positive). Higher overbook counts are associated with more compensation events. However, this correlation is moderated by the no-show level: on flights with high no-show volumes, even large overbook counts produce no compensation. Business implication: Overbook count alone cannot be used as a compensation predictor without conditioning on the no-show buffer. This motivates the regression model in Technique 5.

  3. No-Show ↔︎ Empty Seats (positive). More no-shows are associated with more empty seats at departure. This captures the other side of the trade-off: when no-shows exceed the overbook buffer, the aircraft departs with vacant capacity. Business implication: Routes where no-show rates are persistently high and not matched by a sufficient overbook factor are generating systematic revenue leakage. These routes should receive increased overbook limits.

Correlation ≠ causation: The positive correlation between overbook count and compensation does not imply that overbooking alone causes denial events — no-show behaviour is the mediating variable. A controlled trial (applying a zero-overbook policy on a subset of routes for one quarter) would be required to isolate the causal effect of the booking-limit decision. The correlation analysis motivates the regression model but cannot replace it.


9 Technique 5 — Linear Regression

Book reference: Chapter 9 (OLS) — coefficients, diagnostics, interpretation.

Business justification: As Airport Operations Manager, every compensation event I manage at the gate generates a cascading sequence of costs — passenger re-accommodation, ground delay, complaint records, and sometimes social media exposure. The regression model does not serve as a real-time gate tool: no-show counts are only finalised at check-in counter closure, and once they are known the exact number of passengers requiring compensation is determined directly by the absorption rule (compensate = max(overbook − no_show, 0)), making a regression prediction redundant at that point. The model’s operational value lies upstream, in the policy-setting decision: the regression coefficients quantify — for Revenue Management — the expected compensation burden generated by each additional oversold seat, controlling for aircraft type and seasonal no-show variation. The outputs (Table 12 and Figure 5) allow the overbook limit to be set in advance at the level where expected compensation cost is minimised relative to lost-revenue risk, before the flight ever opens for sale.

9.1 Model specification

The outcome is compensate (passengers requiring compensation per departure). Predictors are:

  • overbook — the primary policy variable
  • no_show — the natural absorber of overbook exposure
  • ac_type — aircraft type (proxy for route scale and booking-management practice)
  • month_num — captures seasonal no-show variation

seat_sold and flown are excluded due to algebraic collinearity with overbook and capacity.

Code
mod <- lm(
  compensate ~ overbook + no_show + ac_type + month_num,
  data = df
)

tidy(mod, conf.int = TRUE) |>
  mutate(
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  kable(
    caption   = "Table 10: OLS Regression Coefficients — Passenger Compensation",
    col.names = c("Term","Estimate","Std Error","t-statistic","p-value",
                  "95% CI Lower","95% CI Upper","Sig."),
    digits    = 4
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 10: OLS Regression Coefficients — Passenger Compensation
Term Estimate Std Error t-statistic p-value 95% CI Lower 95% CI Upper Sig.
(Intercept) -0.1368 0.0612 -2.2367 0.0254 -0.2568 -0.0169 *
overbook 0.3350 0.0092 36.4417 0.0000 0.3170 0.3530 ***
no_show -0.1325 0.0058 -23.0309 0.0000 -0.1438 -0.1212 ***
ac_typeMedium (146) -0.0307 0.0557 -0.5510 0.5817 -0.1400 0.0786
ac_typeLarge (149) -0.1312 0.0564 -2.3278 0.0200 -0.2418 -0.0207 *
month_num 0.0036 0.0074 0.4784 0.6324 -0.0110 0.0181
Code
glance(mod) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  kable(
    caption   = "Table 11: Model Fit Statistics",
    col.names = c("R²","Adj R²","Residual SE","F-statistic","p-value","df","n"),
    digits    = 4
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 11: Model Fit Statistics
Adj R² Residual SE F-statistic p-value df n
0.4135 0.4121 1.0176 302.7134 0 5 2153
Code
cat("Variance Inflation Factors (VIF):\n")
Variance Inflation Factors (VIF):
Code
vif(mod)
              GVIF Df GVIF^(1/(2*Df))
overbook  1.359601  1        1.166019
no_show   1.240797  1        1.113911
ac_type   1.110343  2        1.026513
month_num 1.076900  1        1.037738
Code
par(mfrow = c(2, 2))
plot(mod, which = 1:4, col = adjustcolor(pal_blue, 0.5), pch = 16, cex = 0.6)

Figure 3: Regression Diagnostic Plots
Code
par(mfrow = c(1, 1))
Code
df_aug <- augment(mod, df)

ggplot(df_aug, aes(x = .fitted, y = compensate, colour = ac_type)) +
  geom_point(alpha = 0.35, size = 1.6) +
  geom_abline(colour = "red", linewidth = 1, linetype = "dashed") +
  scale_colour_manual(values = pal_3, name = "Aircraft Type") +
  labs(
    title    = "Figure 4: Predicted vs Actual Compensation Events",
    subtitle = "Points on the dashed line = perfect prediction; spread indicates residual variance",
    x = "Predicted Compensation (pax)", y = "Actual Compensation (pax)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Figure 4: Predicted vs Actual Compensation Events
Code
# Derive the overbook level at which expected compensation = 0
# From the model: compensate_hat = b0 + b1*overbook + b2*no_show + ...
# Setting compensate_hat = 0 and solving for overbook given mean no_show:

coefs     <- coef(mod)
mean_ns   <- mean(df$no_show)
mean_mn   <- mean(df$month_num)

# OB at which E[compensate] = 0 for each aircraft type (holding other vars at mean)
calc_recommended_ob <- function(ac_level) {
  ac_dummy_med  <- as.numeric(ac_level == "Medium (146)")
  ac_dummy_sm   <- as.numeric(ac_level == "Small (70)")
  coef_med <- if ("ac_typeMedium (146)" %in% names(coefs)) coefs["ac_typeMedium (146)"] else 0
  coef_sm  <- if ("ac_typeSmall (70)"   %in% names(coefs)) coefs["ac_typeSmall (70)"]   else 0
  intercept_adj <- coefs["(Intercept)"] +
    coefs["no_show"] * mean_ns +
    coefs["month_num"] * mean_mn +
    coef_med * ac_dummy_med +
    coef_sm  * ac_dummy_sm
  recommended <- -intercept_adj / coefs["overbook"]
  return(round(unname(recommended), 2))
}

recommended_tbl <- tibble(
  `Aircraft Type` = c("Large (149)","Medium (146)","Small (70)"),
  `Recommended Overbook Count (OLS)` = sapply(c("Large (149)","Medium (146)","Small (70)"),
                                               calc_recommended_ob),
  `Mean Observed Overbook`           = df |>
    group_by(ac_type) |> summarise(m = round(mean(overbook), 2)) |> pull(m)
) |>
  mutate(
    `Gap (Recommended − Observed)` = round(`Recommended Overbook Count (OLS)` - `Mean Observed Overbook`, 2)
  )

recommended_tbl |>
  kable(caption = "Table 12: Estimated Low-Risk Overbook Levels by Aircraft Type",
        align = c("l","r","r","r")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 12: Estimated Low-Risk Overbook Levels by Aircraft Type
Aircraft Type Recommended Overbook Count (OLS) Mean Observed Overbook Gap (Recommended − Observed)
Large (149) 2.54 2.78 -0.24
Medium (146) 2.63 4.65 -2.02
Small (70) 2.54 3.97 -1.43

Figure 5: Recommended Overbooking Threshold Estimation

Code
# Trade-off curve: overbook vs predicted compensate for each aircraft type
ob_seq <- seq(1, 20, by = 0.5)
predict_df <- crossing(
  overbook  = ob_seq,
  ac_type   = factor(c("Large (149)","Medium (146)","Small (70)"),
                     levels = levels(df$ac_type)),
  no_show   = mean_ns,
  month_num = mean_mn
)
predict_df$.pred <- predict(mod, newdata = predict_df)

ggplot(predict_df, aes(x = overbook, y = pmax(.pred, 0), colour = ac_type)) +
  geom_line(linewidth = 1.3) +
  geom_hline(yintercept = 0, linetype = "dotted", colour = "grey50") +
  scale_colour_manual(values = pal_3, name = "Aircraft Type") +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  labs(
    title    = "Figure 5: Predicted Compensation vs Overbook Count by Aircraft Type",
    subtitle = paste0("Holding No-Show = ", round(mean_ns, 1),
                      " (mean) and Month = ", round(mean_mn, 0),
                      " (mean). Recommended threshold = where line crosses zero."),
    x = "Overbooked Seats per Flight", y = "Predicted Compensation Events (pax)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Figure 5: Recommended Overbooking Threshold Estimation

Coefficient interpretation (business language):

Term Direction Operational Translation
overbook + Each additional seat oversold increases predicted compensation by β₁ passengers — set the overbook limit below the point where this prediction crosses your compensation-cost threshold
no_show Each additional no-show reduces predicted compensation by
ac_type [Medium] ± If medium aircraft shows a significantly different intercept, it requires a separately calibrated overbook limit, not a scaled version of the large-aircraft limit
ac_type [Small] ± Small-aircraft services operate on thinner no-show buffers; a statistically significant coefficient justifies a conservative, independently-set limit
month_num ± A significant positive coefficient means compensation risk rises as the year progresses; a negative coefficient means risk is front-loaded — either pattern implies that a flat annual overbook rate fails to minimise the combined impact of empty seats and denied boardings

Model limitations: OLS assumes normally distributed, continuous residuals with homoscedastic variance. Because compensate is a non-negative integer count (zero-inflated), these assumptions are violated. The OLS model is used here as specified by the assignment (Ch. 9) and provides interpretable directional coefficients; however, Section 11 identifies a Zero-Inflated Negative Binomial GLM as the preferred modelling approach for operational deployment.


10 Integrated Findings

The five techniques collectively answer the central question — what is the recommended overbooking threshold? — from five complementary angles:

Technique Key Finding Recommended OB Rate Implication
EDA No-show mean (5.56) exceeds overbook mean (3.85); 625 flights still departed with empty seats The current overbook factor is below the no-show mean on most routes — there is room to increase it without raising compensation frequency
Visualisation Compensation risk is concentrated on specific routes (>20% frequency) and in specific month × aircraft combinations A uniform network-wide factor fails these segments; a differentiated matrix is required
Hypothesis Testing Net exposure and compensation probability differ significantly across aircraft types and months Statistically justifies separate overbook factors by aircraft type and a monthly limit calendar
Correlation Net exposure (Overbook − NoShow) is the dominant driver of compensation; no-show rate is the dominant driver of empty seats The recommended threshold balances these two forces; it is the point where expected net exposure ≈ 0 for the median no-show on that route
Regression Compensation is a predictable linear function of overbook count, no-show count, aircraft type, and month The regression coefficients provide a closed-form formula for the break-even overbook count; Table 12 (Estimated Low-Risk Overbook Levels by Aircraft Type) quantifies this per aircraft type

Single recommendation: Implement a route × aircraft-type × month overbook matrix where the target overbook count per cell is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is selected for three reasons: (i) asymmetric costs — denied boarding incurs compensation payments, NCAA regulatory exposure, and measurable brand damage, whereas an empty seat costs only marginal lost revenue, making a higher threshold preferable to the median; (ii) operational tolerance — internal stakeholder consultation indicates that a compensation event rate below 20% per departure is operationally manageable for gate teams and acceptable to the revenue management function; (iii) empirical fit — the dataset shows that on 80% of observed departures, no-shows equalled or exceeded 4 seats, meaning an 80th-percentile threshold translates to a concrete, immediately actionable overbook count rather than a theoretical ideal. Setting the target at the 80th percentile therefore ensures the overbook buffer is absorbed on 80% of all departures (eliminating empty seats on 80% of flights), while compensation events occur only on the 20% of flights where no-shows fall below that threshold. The regression model informs this pre-flight policy decision — it is not a gate-close tool. Once check-in closes and the final no-show count is known, the required number of re-accommodations is determined directly by max(overbook − no_show, 0); at that point, gate teams initiate the voluntary off-load announcement based on that direct count, not on a model prediction.


11 Limitations & Further Work

  1. OLS for zero-inflated count data. Both overbook and compensate are non-negative integers. OLS does not respect the integer lower bound, can predict negative values, and its residuals violate normality assumptions under zero-inflation. A Zero-Inflated Negative Binomial GLM — which separately models the probability of any compensation event and the count conditional on an event occurring — would materially improve out-of-sample prediction accuracy for the rare but high-impact flights where compensate > 5.

  2. No revenue or cost data. The dataset does not contain the fare paid per passenger, the cost of compensation under Nigerian Civil Aviation Authority regulations, or the marginal cost of carrying an empty seat. The analysis minimises compensation event count, not net revenue impact. Integrating booking-system fare data and compensation cost rates would enable a full expected-profit model where the recommended overbook rate maximises E[Revenue from oversold seats] − E[Compensation cost].

  3. Single year of data. The analysis covers January–December 2025 only. Seasonal patterns (e.g., the peak-month no-show surge) cannot be distinguished from year-specific events (fuel crises, public-holiday calendar shifts, competitor entry). Three or more years of data would enable reliable seasonal decomposition and trend identification.

  4. No causal identification. The correlation between overbook count and compensation is observational. Routes with high overbook counts may also operate in high-demand corridors where passengers are less likely to no-show (business travellers vs. leisure), confounding the comparison. A difference-in-differences design — comparing compensation rates before and after a deliberate policy change on a subset of routes — would isolate the causal effect of the overbook limit.

  5. Route-level heterogeneity not fully modelled. The regression uses aircraft type and month as group-level predictors, but individual routes have distinct demand profiles, no-show cultures, and booking lead times. A multilevel (mixed-effects) model with route as a random effect would better partition within-route and between-route variation, producing route-specific overbook recommendations with appropriate uncertainty intervals.


12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Agu-Uzochukwu, O. (2025). Nigerian domestic airline flight overbooking and demand data — Overbook Report 2025 [Dataset]. Revenue Management Department, Nigerian Domestic Airline, Lagos, Nigeria. Data available on request from the author.

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Code
pkgs_cite <- c("janitor","skimr","ggcorrplot","car",
               "broom","kableExtra","patchwork","nortest","effectsize","lmtest")
for (p in pkgs_cite) {
  cat(paste0("*", p, ":* "))
  tryCatch({
    cat(format(citation(p), style = "text")[1])
    cat("\n\n")
  }, error = function(e) cat("See package documentation.\n\n"))
}

janitor: Firke S (2024). janitor: Simple Tools for Examining and Cleaning Dirty Data. doi:10.32614/CRAN.package.janitor https://doi.org/10.32614/CRAN.package.janitor, R package version 2.2.1, https://CRAN.R-project.org/package=janitor.

skimr: Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2026). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.2.2, https://CRAN.R-project.org/package=skimr.

ggcorrplot: Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

car: Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

broom: Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

kableExtra: Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

patchwork: Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.

nortest: Gross J, Ligges U (2015). nortest: Tests for Normality. doi:10.32614/CRAN.package.nortest https://doi.org/10.32614/CRAN.package.nortest, R package version 1.0-4, https://CRAN.R-project.org/package=nortest.

effectsize: Ben-Shachar MS, Lüdecke D, Makowski D (2020). “effectsize: Estimation of Effect Size Indices and Standardized Parameters.” Journal of Open Source Software, 5(56), 2815. doi:10.21105/joss.02815 https://doi.org/10.21105/joss.02815, https://doi.org/10.21105/joss.02815.

lmtest: Zeileis A, Hothorn T (2002). “Diagnostic Checking in Regression Relationships.” R News, 2(3), 7-10. https://CRAN.R-project.org/doc/Rnews/.


13 Appendix: AI Usage Statement

Claude (Anthropic), an AI coding assistant, was used in this project to: (i) generate the initial structure of this Quarto document and scaffold R code chunks for each of the five required techniques; (ii) suggest appropriate statistical tests given the data structure (e.g., recommending Kruskal-Wallis over ANOVA after EDA confirmed non-normal count distributions); (iii) draft narrative text templates for the business interpretation sections; and (iv) construct the Python conversion script that transformed the binary .xlsb source file into the .csv format loaded by R.

All analytical decisions — which variables to treat as the outcome and predictors, which aircraft grouping to apply, how to interpret the net exposure concept, what the recommended overbooking threshold means in operational terms, and how to frame the trade-off between empty seats and passenger compensation — were made independently by the author based on direct operational experience as an Airport Operations Manager at a Nigerian domestic airline and on the course material from Prof Bongo Adi’s Data Analytics 1 programme at Lagos Business School. The author verified every result against the actual rendered output and takes full responsibility for all interpretations and conclusions presented in this document.

The underlying dataset was extracted directly from the airline’s Departure Control System. No data was fabricated, simulated, or otherwise altered.