Overbooking Analytics: Exploratory and Inferential Analysis of a Nigerian Domestic Airline’s Flight Operations (2025)

Author

Ogechi Agu-Uzochukwu

Published

May 19, 2026


1 Executive Summary

A Nigerian domestic airline practises deliberate overbooking — selling more seats than aircraft capacity — to offset revenue losses from passenger no-shows. The core trade-off is precise: flying with empty seats costs the airline lost fare revenue; denying a passenger boarding costs compensation, regulatory liability, and brand damage. Neither extreme is acceptable. The recommended overbooking threshold sits at the point where no-show absorption and overbook exposure are in balance.

This study analyses 2,153 domestic flight departures across the full calendar year 2025, covering 16 routes between seven airports: Abuja (ABV), Lagos (LOS), Port Harcourt (PHC), Benin (BNI), Jos (JOS), Asaba (ABB), and Warri (QRW). Three aircraft types are operated — 70-seat, 146-seat, and 149-seat — each with distinct booking dynamics. Five analytical techniques (EDA, Visualisation, Hypothesis Testing, Correlation, and Linear Regression) are applied to determine the structural drivers of overbooking efficiency and passenger compensation.

Key findings: (i) the airline currently overbooks at a mean rate of 3.85 seats per flight, but no-shows average 5.56, meaning the overbook buffer is consistently absorbed on most flights; (ii) on 294 flights (13.7% of all departures), overbooking exceeded no-shows, triggering passenger compensation; (iii) 625 flights (29%) still carried empty seats despite overbooking, indicating the current overbook factor is below the no-show mean on many routes; (iv) aircraft type and month are significantly associated with compensation event frequency, based on non-parametric and chi-squared hypothesis testing.

Recommendation: The airline should adopt a route × aircraft-type × month overbook rate matrix where the target overbook count is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is chosen because it represents a deliberate asymmetry: the cost of one denied boarding (compensation, regulatory liability, and reputational damage) materially exceeds the cost of one empty seat (foregone marginal revenue only), so the threshold is set above the median to absorb no-shows on most departures while accepting a controlled residual risk of 20% rather than 50%. This eliminates empty-seat losses on 80% of departures while accepting a controlled compensation event rate of approximately 20%.


2 Professional Disclosure

Role: Airport Operations Manager at a Nigerian domestic airline — responsible for station operations, ensuring safe and on-time flight departures, customer service efficiency, and passenger satisfaction. The role controls station operational costs, manages flight delays, cancellations, and operational disruptions, resolves customer complaints, and oversees the ground operations team.

Organisation: A Nigerian domestic airline operating scheduled passenger services across seven airports in the domestic aviation sector. The organisation’s core business is time-critical passenger movement on high-frequency short-haul routes, where seat yield, on-time performance, and passenger experience are the primary operational and commercial metrics.

Technique relevance to daily work:

# Technique Why it matters in this role
1 EDA I review daily departure data to identify patterns in overbooking, no-shows, and compensation events. EDA surfaces distributional anomalies — e.g., atypically high overbook counts on specific routes — that would be buried in weekly summary reports
2 Visualisation Station performance dashboards are presented to the Head of Ground Operations every Monday. Clear visualisations of the overbook-vs-no-show balance and compensation frequency translate operational data into actions that non-technical managers can implement immediately
3 Hypothesis Testing When compensation incidents spike on a route, I need statistical evidence to determine whether the pattern is a genuine policy failure or random variation before escalating a formal policy-change request to Revenue Management
4 Correlation When compensation events spike, I need to identify which variable is the dominant driver before escalating. If overbook count drives compensation, the case goes to Revenue Management to revise the booking limit. If no-show variability drives it, the case goes to Revenue Management’s demand forecasting function to recalibrate the no-show model. Correlation analysis tells me which argument to make and to whom
5 Linear Regression Every compensation event I manage at the gate generates costs I am accountable for: passenger re-accommodation, delay, complaints, and regulatory exposure. When I present a policy-change request to Revenue Management, anecdote is insufficient — I need a quantified argument. A regression model that relates compensation frequency to the overbook rate, aircraft type, and month gives me that argument: the coefficients show exactly how many additional compensation events each extra oversold seat produces, by aircraft type and by season. Table 12 and Figure 5 express this as a concrete overbook-rate matrix that Revenue Management can act on directly

3 Data Collection & Sampling

Source: Airline departure control system (DCS).

Collection method: Extracted DCS data for the period and saved it in Excel format.

Sampling frame: All domestic scheduled departures in calendar year 2025, across seven airports and 16 origin–destination routes.

Sample size: 2,153 flight departures, 29 unique flight numbers, 3 aircraft types.

Time period: 1 January 2025 – 31 December 2025 (full calendar year; no gaps).

Variables collected (12 source + 5 derived):

Code
var_info <- tibble(
  Variable = c("Flt No","Dep Port","Arr Port","Route","Sch Dep Dt","Months",
               "Capacity","Seat Sold","Overbook","Flown","NoShow","Compensate",
               "— empty_seats","— ob_rate","— ns_rate","— net_exposure","— denied_flag"),
  Source   = c(rep("Source",12), rep("Derived",5)),
  Type     = c("numeric","categorical","categorical","categorical","date","ordinal",
               "numeric","numeric","numeric","numeric","numeric","numeric",
               "numeric","numeric","numeric","numeric","binary"),
  Role     = c(
    "Flight identifier",
    "Departure airport (categorical predictor)",
    "Arrival airport (categorical predictor)",
    "Origin–destination pair (categorical predictor)",
    "Scheduled departure date (time index)",
    "Calendar month 1–12 (ordinal predictor)",
    "Aircraft seat capacity: 70, 146, or 149 (binary predictor)",
    "Total seats sold per departure",
    "Oversold seats per departure (primary policy variable)",
    "Actual passengers who boarded (≤ Capacity)",
    "Passengers who did not board (key buffer variable)",
    "Passengers compensated for denied boarding (primary outcome)",
    "Capacity − Flown: empty seats on departure (lost-revenue metric)",
    "Overbook / Capacity: overbooking rate (normalised policy variable)",
    "NoShow / Seat Sold: no-show rate per departure",
    "Overbook − NoShow: >0 triggers compensation",
    "Compensate > 0 indicator (Yes/No)"
  )
)

var_info |>
  kable(
    caption   = "Table 1: Variable Inventory — Overbook_Jay_new.csv",
    col.names = c("Variable", "Source", "Type", "Role"),
    align     = c("l","c","l","l")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  row_spec(0, bold = TRUE, background = "#F5F5F5") |>
  row_spec(13:17, italic = TRUE, color = "grey40") |>
  column_spec(1, bold = TRUE, width = "12em") |>
  column_spec(2, width = "6em") |>
  column_spec(3, width = "8em")
Table 1: Variable Inventory — Overbook_Jay_new.csv
Variable Source Type Role
Flt No Source numeric Flight identifier
Dep Port Source categorical Departure airport (categorical predictor)
Arr Port Source categorical Arrival airport (categorical predictor)
Route Source categorical Origin–destination pair (categorical predictor)
Sch Dep Dt Source date Scheduled departure date (time index)
Months Source ordinal Calendar month 1–12 (ordinal predictor)
Capacity Source numeric Aircraft seat capacity: 70, 146, or 149 (binary predictor)
Seat Sold Source numeric Total seats sold per departure
Overbook Source numeric Oversold seats per departure (primary policy variable)
Flown Source numeric Actual passengers who boarded (≤ Capacity)
NoShow Source numeric Passengers who did not board (key buffer variable)
Compensate Source numeric Passengers compensated for denied boarding (primary outcome)
— empty_seats Derived numeric Capacity − Flown: empty seats on departure (lost-revenue metric)
— ob_rate Derived numeric Overbook / Capacity: overbooking rate (normalised policy variable)
— ns_rate Derived numeric NoShow / Seat Sold: no-show rate per departure
— net_exposure Derived numeric Overbook − NoShow: >0 triggers compensation
— denied_flag Derived binary Compensate > 0 indicator (Yes/No)

Data loading:

Code
# Source file: Overbook_Jay_new.xlsb extracted from DCS and saved as Excel,
# then converted to CSV using pyxlsb before loading into R
df_raw <- read_csv(
  "Overbook_Jay_new.csv",
  col_types = cols(
    `Flt No`     = col_double(),
    `Dep Port`   = col_character(),
    `Arr Port`   = col_character(),
    `Route`      = col_character(),
    `Sch Dep Dt` = col_date(format = "%Y-%m-%d"),
    `Months`     = col_double(),
    `Capacity`   = col_double(),
    `Seat Sold`  = col_double(),
    `Overbook`   = col_double(),
    `Flown`      = col_double(),
    `NoShow`     = col_double(),
    `Compensate` = col_double()
  )
)

Data cleaning and derived variables:

Code
df <- df_raw |>
  clean_names() |>
  mutate(
    date       = sch_dep_dt,
    month_num  = as.integer(months),
    month_lbl  = factor(month_num, levels = 1:12, labels = month.abb, ordered = TRUE),
    ac_type    = factor(
      case_when(
        capacity == 70  ~ "Small (70)",
        capacity == 146 ~ "Medium (146)",
        capacity == 149 ~ "Large (149)",
        TRUE            ~ "Other"
      ),
      levels = c("Small (70)", "Medium (146)", "Large (149)")
    ),
    dep_port     = factor(dep_port),
    arr_port     = factor(arr_port),
    route        = factor(route),
    empty_seats  = capacity - flown,
    ob_rate      = overbook / capacity,
    ns_rate      = no_show / seat_sold,
    net_exposure = overbook - no_show,
    denied_flag  = if_else(compensate > 0, "Compensated", "No Compensation")
  ) |>
  select(-sch_dep_dt)

# Consistent colour palette used across all plots
pal_blue   <- "#1F497D"
pal_orange <- "#C55A11"
pal_mid    <- "#2E75B6"
pal_green  <- "#375E23"
pal_3      <- c("Large (149)" = pal_blue, "Medium (146)" = pal_mid, "Small (70)" = pal_orange)

Ethical note: The dataset contains no personally identifiable information (PII). All records are flight-level aggregates — no passenger names, booking references, or employee IDs are present. Route-level performance statistics are standard internal reporting at the airline and are published here solely for academic analysis under Lagos Business School supervision.


4 Data Description

Code
cat("Dimensions:", nrow(df), "rows ×", ncol(df), "columns\n\n")
Dimensions: 2153 rows × 20 columns
Code
# Missing values displayed as a formatted two-column table
colSums(is.na(df)) |>
  as.data.frame() |>
  setNames("Missing Values") |>
  rownames_to_column("Variable") |>
  kable(
    caption   = "Table: Missing Values by Column (0 = complete)",
    col.names = c("Variable", "Missing Values"),
    align     = c("l", "r")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE, position = "left"
  )
Table: Missing Values by Column (0 = complete)
Variable Missing Values
flt_no 0
dep_port 0
arr_port 0
route 0
months 0
capacity 0
seat_sold 0
overbook 0
flown 0
no_show 0
compensate 0
date 0
month_num 0
month_lbl 0
ac_type 0
empty_seats 0
ob_rate 0
ns_rate 0
net_exposure 0
denied_flag 0
Code
skim(df |> select(capacity, overbook, flown, no_show, compensate,
                  empty_seats, ob_rate, ns_rate, net_exposure, month_num))
Data summary
Name select(…)
Number of rows 2153
Number of columns 10
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
capacity 0 1 122.86 36.02 70.00 70.00 146.00 149.00 149.00 ▃▁▁▁▇
overbook 0 1 3.85 2.78 1.00 2.00 3.00 5.00 38.00 ▇▁▁▁▁
flown 0 1 120.76 36.21 60.00 70.00 144.00 146.00 149.00 ▃▁▁▁▇
no_show 0 1 5.56 4.25 0.00 2.00 4.00 8.00 20.00 ▇▃▂▁▁
compensate 0 1 0.39 1.33 0.00 0.00 0.00 0.00 21.00 ▇▁▁▁▁
empty_seats 0 1 2.10 3.47 0.00 0.00 0.00 5.00 10.00 ▇▁▁▁▁
ob_rate 0 1 0.03 0.02 0.01 0.01 0.03 0.04 0.30 ▇▁▁▁▁
ns_rate 0 1 0.05 0.04 0.00 0.01 0.03 0.07 0.24 ▇▃▁▁▁
net_exposure 0 1 -1.71 3.93 -10.00 -5.00 0.00 0.00 21.00 ▃▇▁▁▁
month_num 0 1 6.44 3.06 1.00 4.00 6.00 9.00 12.00 ▆▇▆▇▆
Code
p_hist <- function(var, label, fill_col) {
  ggplot(df, aes(x = .data[[var]])) +
    geom_histogram(bins = 30, fill = fill_col, colour = "white", alpha = 0.85) +
    scale_y_continuous(labels = comma) +
    labs(x = label, y = "Count") +
    theme_minimal(base_size = 11)
}

p1 <- p_hist("overbook",    "Overbook (seats)", pal_blue)
p2 <- p_hist("no_show",     "No-Shows (pax)",   pal_mid)
p3 <- p_hist("compensate",  "Compensate (pax)", pal_orange)
p4 <- ggplot(df, aes(x = ac_type, fill = ac_type)) +
        geom_bar(show.legend = FALSE) +
        scale_fill_manual(values = pal_3) +
        scale_y_continuous(labels = comma) +
        labs(x = "Aircraft Type", y = "Departures") +
        theme_minimal(base_size = 11)

(p1 | p2) / (p3 | p4) +
  plot_annotation(
    title    = "Figure 1: Distributions of Core Variables and Aircraft Mix",
    subtitle = "Overbook is right-skewed; Compensate is zero-inflated; three aircraft types operate",
    theme    = theme(plot.title = element_text(face = "bold", size = 13))
  )

Figure 1: Distributions of key numeric variables

What the data shows — in plain language: On a typical flight, the airline sells roughly 4 extra seats beyond what the aircraft can physically carry. On those same flights, around 6 passengers on average simply do not show up — which means the buffer of absent passengers is usually large enough to absorb the extra bookings without anyone being turned away. However, on about 1 in 7 flights (294 out of 2,153), more passengers arrived than there were seats, meaning some travellers had to be compensated and denied boarding. On the other side, nearly 3 in every 10 flights (625 out of 2,153) still had empty seats at departure despite overbooking — meaning the airline left revenue on the table even while trying to protect against no-shows. The fleet operates three aircraft sizes — small (70 seats, 682 flights), medium (146 seats, 802 flights), and large (149 seats, 669 flights) — in roughly equal proportions across the network.


5 Technique 1 — Exploratory Data Analysis (EDA)

5.1 Theory Recap

Exploratory Data Analysis (EDA) is the foundational, non-negotiable first step in any data investigation. As Adi (2026, Chapter 4) explains, EDA encompasses four activities carried out before any modelling begins: (1) descriptive statistics — computing measures of central tendency (mean, median), dispersion (standard deviation, IQR), and shape (skewness, kurtosis) for each numeric variable; (2) missing-value profiling — identifying which columns contain gaps and whether missingness is random or systematic; (3) outlier detection — flagging observations that fall beyond the IQR fence (Q3 + 1.5 × IQR) and deciding whether they represent genuine extreme events or data errors; and (4) distributional shape assessment — using histograms and density plots to determine whether variables are normally distributed, skewed, zero-inflated, or multimodal.

A critical lesson from Anscombe’s Quartet (Adi, 2026, Chapter 4) is that four entirely different datasets can share identical means, variances, and Pearson correlations, yet display radically different scatterplot patterns. This demonstrates that numerical summaries alone are insufficient — visual inspection is not optional. EDA findings directly govern which statistical tests and models are valid downstream: skewed or zero-inflated variables violate OLS normality assumptions and require non-parametric tests or specialised GLMs.

Business justification: EDA is the mandatory first step before any overbooking policy recommendation. It reveals whether the data is clean, identifies operational anomalies, and determines whether the overbook rate is systematically above or below the no-show buffer across different route and aircraft segments. At a Nigerian domestic airline, a single high-overbook departure (e.g., a mis-priced group block) can distort network-level averages if not identified and understood before building a predictive model.

Code
# ── Data quality issue 1: Verify the no-show absorption rule ─────────────
# Expected: Compensate > 0 if and only if Overbook > NoShow
rule_check <- df |>
  mutate(
    rule_holds = (compensate > 0) == (net_exposure > 0)
  ) |>
  summarise(
    total_flights         = n(),
    compensate_events     = sum(compensate > 0),
    net_positive_exposure = sum(net_exposure > 0),
    rule_consistent       = sum(rule_holds),
    rule_violation_pct    = round((1 - mean(rule_holds)) * 100, 2)
  )

rule_check |>
  kable(
    caption   = "Table 2: Absorption Rule Verification (Compensate ↔ Overbook > NoShow)",
    col.names = c("Total Flights","Compensate Events","Net Exposure > 0",
                  "Rule Consistent","Rule Violation %")
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 2: Absorption Rule Verification (Compensate ↔︎ Overbook > NoShow)
Total Flights Compensate Events Net Exposure > 0 Rule Consistent Rule Violation %
2153 294 294 2153 0
Code
# ── Data quality issue 2: Empty seats on overbooked flights ──────────────
# If overbook > 0 AND empty_seats > 0, the overbook failed to fill the plane
empty_ob <- df |>
  filter(empty_seats > 0) |>
  summarise(
    flights_with_empty = n(),
    pct_of_total       = round(n() / nrow(df) * 100, 1),
    avg_empty          = round(mean(empty_seats), 2),
    max_empty          = max(empty_seats),
    revenue_exposure   = paste0("Avg ", round(mean(empty_seats), 1), " empty seats per affected flight")
  )

empty_ob |>
  kable(
    caption   = "Table 3: Flights with Empty Seats Despite Overbooking — Revenue Leakage",
    col.names = c("Flights with Empty Seats","% of Total","Avg Empty","Max Empty","Revenue Note")
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 3: Flights with Empty Seats Despite Overbooking — Revenue Leakage
Flights with Empty Seats % of Total Avg Empty Max Empty Revenue Note
625 29 7.22 10 Avg 7.2 empty seats per affected flight
Code
# ── Outlier detection on Overbook ────────────────────────────────────────
Q1  <- quantile(df$overbook, 0.25)
Q3  <- quantile(df$overbook, 0.75)
IQR_val     <- Q3 - Q1
upper_fence <- Q3 + 1.5 * IQR_val

outliers_ob <- df |> filter(overbook > upper_fence)
cat("Overbook IQR upper fence:", upper_fence,
    "| Outlier flights:", nrow(outliers_ob),
    "(", round(nrow(outliers_ob)/nrow(df)*100, 1), "%)\n")
Overbook IQR upper fence: 9.5 | Outlier flights: 79 ( 3.7 %)
Code
# Top 10 highest-overbook flights
df |>
  arrange(desc(overbook)) |>
  select(date, route, ac_type, capacity, seat_sold, overbook,
         no_show, empty_seats, compensate) |>
  slice_head(n = 10) |>
  kable(
    caption   = "Table 4: Top 10 Highest-Overbook Flights",
    col.names = c("Date","Route","Aircraft","Capacity","Sold","Overbook",
                  "NoShow","Empty Seats","Compensate")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 4: Top 10 Highest-Overbook Flights
Date Route Aircraft Capacity Sold Overbook NoShow Empty Seats Compensate
2025-07-20 ABVLOS Medium (146) 146 184 38 17 0 21
2025-11-16 LOSABV Medium (146) 146 167 21 9 0 12
2025-09-18 BNILOS Small (70) 70 91 21 20 0 1
2025-10-10 PHCLOS Large (149) 149 169 20 11 0 9
2025-12-21 ABVLOS Medium (146) 146 165 19 4 0 15
2025-04-13 ABVLOS Medium (146) 146 164 18 18 0 0
2025-12-13 ABVPHC Medium (146) 146 164 18 8 0 10
2025-08-03 ABVLOS Large (149) 149 166 17 12 0 5
2025-07-10 LOSABV Medium (146) 146 163 17 15 0 2
2025-12-01 LOSABV Medium (146) 146 163 17 14 0 3
Code
# ── Monthly performance summary ──────────────────────────────────────────
df |>
  group_by(month_lbl) |>
  summarise(
    Flights         = n(),
    Avg_Overbook    = round(mean(overbook), 2),
    Avg_NoShow      = round(mean(no_show), 2),
    Avg_Compensate  = round(mean(compensate), 3),
    Avg_EmptySeats  = round(mean(empty_seats), 2),
    Comp_Rate       = paste0(round(mean(compensate > 0) * 100, 1), "%")
  ) |>
  kable(
    caption   = "Table 5: Monthly Performance Summary",
    col.names = c("Month","Flights","Avg Overbook","Avg NoShow",
                  "Avg Compensate","Avg Empty Seats","% Flights with Compensation")
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 5: Monthly Performance Summary
Month Flights Avg Overbook Avg NoShow Avg Compensate Avg Empty Seats % Flights with Compensation
Jan 95 2.92 4.80 0.326 2.21 14.7%
Feb 154 3.39 5.09 0.253 1.95 9.7%
Mar 128 3.23 5.09 0.258 2.12 9.4%
Apr 311 3.56 5.35 0.328 2.12 11.6%
May 221 3.20 5.02 0.199 2.02 11.8%
Jun 178 3.68 5.37 0.264 1.96 11.8%
Jul 237 4.04 5.92 0.409 2.29 13.1%
Aug 255 3.46 5.40 0.278 2.22 12.9%
Sep 208 3.90 5.73 0.308 2.13 13.9%
Oct 106 4.42 6.44 0.387 2.41 13.2%
Nov 98 5.05 6.23 0.816 2.00 19.4%
Dec 162 6.10 6.66 1.130 1.69 27.2%
Code
import pandas as pd
import numpy as np

df_py = pd.read_csv("Overbook_Jay_new.csv", parse_dates=["Sch Dep Dt"])
df_py.columns = [c.lower().replace(" ", "_") for c in df_py.columns]
df_py = df_py.rename(columns={"noshow": "no_show", "sch_dep_dt": "date"})

# Derived columns
df_py["empty_seats"]  = df_py["capacity"] - df_py["flown"]
df_py["ob_rate"]      = df_py["overbook"] / df_py["capacity"]
df_py["net_exposure"] = df_py["overbook"] - df_py["no_show"]

print("=== Shape:", df_py.shape)
=== Shape: (2153, 15)
Code
print("\n=== Missing values:\n", df_py.isnull().sum().to_string())

=== Missing values:
 flt_no          0
dep_port        0
arr_port        0
route           0
date            0
months          0
capacity        0
seat_sold       0
overbook        0
flown           0
no_show         0
compensate      0
empty_seats     0
ob_rate         0
net_exposure    0
Code
print("\n=== Numeric summary:")

=== Numeric summary:
Code
cols = ["capacity","overbook","flown","no_show","compensate","empty_seats","ob_rate"]
print(df_py[cols].describe().round(3).to_string())
       capacity  overbook     flown   no_show  compensate  empty_seats   ob_rate
count  2153.000  2153.000  2153.000  2153.000    2153.000     2153.000  2153.000
mean    122.858     3.849   120.761     5.559       0.386        2.097     0.033
std      36.021     2.782    36.206     4.247       1.327        3.471     0.023
min      70.000     1.000    60.000     0.000       0.000        0.000     0.007
25%      70.000     2.000    70.000     2.000       0.000        0.000     0.014
50%     146.000     3.000   144.000     4.000       0.000        0.000     0.029
75%     149.000     5.000   146.000     8.000       0.000        5.000     0.043
max     149.000    38.000   149.000    20.000      21.000       10.000     0.300
Code
print("\n=== Aircraft type distribution:")

=== Aircraft type distribution:
Code
print(df_py["capacity"].value_counts().sort_index().to_string())
capacity
70.0     682
146.0    802
149.0    669
Code
print("\n=== Compensation events:", (df_py["compensate"] > 0).sum(),
      "out of", len(df_py), "flights")

=== Compensation events: 294 out of 2153 flights
Code
print("=== Flights with empty seats:", (df_py["empty_seats"] > 0).sum())
=== Flights with empty seats: 625

EDA findings:

  1. Data quality confirmed — no missing values. All 2,153 rows are complete across all 12 source variables. Because the dataset contains the full population of departures within the study period, traditional sampling-selection concerns are reduced.

  2. The absorption rule holds perfectly. Compensation events occur if and only if Overbook > NoShow (i.e., net_exposure > 0). This is not an accounting artefact — it reflects the airline’s confirmed policy: every seat sold beyond no-show absorption results in a denied boarding with compensation. The rule holds in 100% of rows.

  3. Revenue leakage from empty seats (625 flights, 29%). Despite overbooking on every departure, 625 flights still landed with empty seats — meaning the current overbook factor was insufficient to compensate for the actual no-show volume on those days. This is the underperformance side of the trade-off and a direct revenue loss signal.

  4. Outlier overbook flights exist but are operationally explainable. Flights where overbook > 12 (IQR upper fence) most commonly result from an aircraft downgrade — a last-minute swap to a smaller aircraft after tickets have already been sold on the larger one. Because the new capacity is lower, all previously booked passengers instantly become “over” the limit, producing a sudden overbooking spike that is involuntary and unrelated to the airline’s deliberate yield policy. These flights are retained as genuine observations (the event did occur and triggered real compensation) but are flagged in regression diagnostics because their outsized leverage can distort the model if left unexplained.


6 Technique 2 — Data Visualisation

6.1 Theory Recap

Data visualisation translates statistical summaries into perceptual signals that decision-makers can understand and act on. Adi (2026, Chapter 5) grounds visualisation in the grammar of graphics framework, in which every chart is decomposed into discrete, composable layers: (1) data — the underlying data frame; (2) aesthetics — mappings from variables to visual properties such as x-position, y-position, colour, shape, and size; (3) geometric objects (geoms) — the visual marks used, such as bars, lines, points, or tiles; (4) scales — controlling how aesthetic values are translated to visible outputs (e.g., a gradient colour scale); and (5) facets — small-multiples that repeat the same chart across subgroups.

Effective chart selection is governed by the comparison being made (Adi, 2026, Chapter 5): time trends call for line charts; univariate distributions call for histograms or violin plots; bivariate relationships between numeric variables call for scatter plots; cross-tabulations of two categorical variables call for heatmaps or grouped bar charts. Colour should encode meaning consistently and sparingly — using the same hue for the same variable across all plots in a document. Storytelling with data requires that all charts in a section serve a single, coherent narrative, building cumulatively toward one actionable conclusion rather than displaying every available variable.

Business justification: Revenue Management and Operations leadership are not data scientists. A well-chosen visualisation removes ambiguity from a statistical summary and makes seasonal and route-level patterns immediately actionable. The five plots below tell one coherent story: where, when, and on which aircraft the balance between overbooking and no-show absorption breaks down — and what that means for the compensation budget.

Code
p_v1 <- df |>
  group_by(month_lbl) |>
  summarise(
    Avg_Overbook   = mean(overbook),
    Avg_NoShow     = mean(no_show),
    Avg_EmptySeats = mean(empty_seats),
    .groups        = "drop"
  ) |>
  pivot_longer(c(Avg_Overbook, Avg_NoShow, Avg_EmptySeats),
               names_to = "Metric", values_to = "Value") |>
  mutate(Metric = case_match(Metric,
    "Avg_Overbook"   ~ "Avg Overbooked Seats",
    "Avg_NoShow"     ~ "Avg No-Shows",
    "Avg_EmptySeats" ~ "Avg Empty Seats at Departure"
  )) |>
  ggplot(aes(x = month_lbl, y = Value, colour = Metric, group = Metric)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_colour_manual(values = c(
    "Avg Overbooked Seats"          = pal_blue,
    "Avg No-Shows"                  = pal_green,
    "Avg Empty Seats at Departure"  = pal_orange
  )) +
  labs(
    title    = "Plot 1: Monthly Overbooking, No-Show & Empty-Seat Trends (2025)",
    subtitle = "When No-Shows > Overbook, empty seats result; when Overbook > No-Shows, compensation occurs",
    x = "Month", y = "Average per Flight", colour = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))
p_v1

Plot 1: Monthly Overbooking and No-Show Trend
Code
p_v2 <- df |>
  group_by(route) |>
  summarise(
    Comp_Rate    = mean(compensate > 0) * 100,
    Avg_Overbook = mean(overbook),
    Avg_NoShow   = mean(no_show),
    n            = n(),
    .groups      = "drop"
  ) |>
  arrange(desc(Comp_Rate)) |>
  ggplot(aes(x = reorder(route, Comp_Rate), y = Comp_Rate, fill = Avg_Overbook)) +
  geom_col(show.legend = TRUE) +
  geom_text(aes(label = paste0(round(Comp_Rate, 1), "%")),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  coord_flip() +
  scale_fill_gradient(low = "#D6E8F5", high = pal_blue, name = "Avg\nOverbook") +
  scale_y_continuous(limits = c(0, 35), labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Plot 2: Compensation Rate by Route (shaded by average overbook count)",
    subtitle = "Some routes exceed 20% compensation frequency — a policy redesign priority",
    x = "Route", y = "% of Flights with Compensation Event"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v2

Plot 2: Compensation Rate by Route
Code
p_v3 <- ggplot(df, aes(x = no_show, y = overbook, colour = denied_flag, shape = ac_type)) +
  geom_jitter(alpha = 0.40, size = 1.8, width = 0.25, height = 0.25) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed",
              colour = "grey30", linewidth = 0.9) +
  scale_colour_manual(values = c("Compensated" = pal_orange, "No Compensation" = pal_mid),
                      name = "Outcome") +
  scale_shape_manual(values = c("Large (149)" = 16, "Medium (146)" = 17, "Small (70)" = 15),
                     name = "Aircraft") +
  annotate("text", x = 15, y = 3, label = "← No-shows absorb overbook\n(empty seats possible)",
           size = 3, colour = pal_mid, hjust = 0) +
  annotate("text", x = 1, y = 20, label = "Overbook > No-shows\n→ Compensation →",
           size = 3, colour = pal_orange, hjust = 0) +
  labs(
    title    = "Plot 3: The Overbooking Trade-Off — No-Shows vs Overbooked Seats",
    subtitle = "Points above the diagonal → overbook exceeds no-shows → passenger compensation required",
    x = "No-Shows per Departure", y = "Overbooked Seats per Departure"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v3

Plot 3: Overbook vs No-Show trade-off by aircraft type
Code
p_v4 <- ggplot(df, aes(x = ac_type, y = ob_rate * 100, fill = ac_type)) +
  geom_violin(trim = FALSE, alpha = 0.65) +
  geom_boxplot(width = 0.10, fill = "white", outlier.colour = pal_orange, outlier.size = 1.5) +
  scale_fill_manual(values = pal_3) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Plot 4: Overbooking Rate (%) Distribution by Aircraft Type",
    subtitle = "Small aircraft are booked at a higher percentage rate than large aircraft",
    x = "Aircraft Type", y = "Overbooking Rate (% of Capacity)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none", plot.title = element_text(face = "bold"))
p_v4

Plot 4: Overbook rate distribution by aircraft type
Code
# Note on the previous version: Avg_NetExposure (Overbook − NoShow) is negative for every
# month × aircraft cell because the fleet-wide mean no-show (5.56) exceeds the mean overbook
# (3.85), so all group averages sit below zero and only the green end of a diverging scale
# appears. To make the heatmap operationally meaningful and visually correct, we use the
# compensation rate (% of flights with Compensate > 0) as the fill — a strictly positive
# metric that clearly encodes risk in warm colours.
p_v5 <- df |>
  group_by(month_lbl, ac_type) |>
  summarise(
    Avg_NetExposure = round(mean(net_exposure), 1),
    Comp_Rate       = mean(compensate > 0) * 100,
    .groups         = "drop"
  ) |>
  ggplot(aes(x = month_lbl, y = ac_type, fill = Comp_Rate)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = paste0(round(Comp_Rate, 1), "%")),
            size = 3.2, fontface = "bold", colour = "white") +
  scale_fill_gradient(low = "#FDE8D8", high = pal_orange,
                      name = "% Flights\nCompensated") +
  labs(
    title    = "Plot 5: Compensation Rate (%) by Month & Aircraft Type",
    subtitle = "Darker orange = higher share of flights where passengers were compensated for denied boarding",
    x = "Month", y = "Aircraft Type"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
p_v5

Plot 5: Compensation Rate Heatmap — Month × Aircraft Type

Visualisation narrative: The five charts collectively map the full overbooking trade-off. Plot 1 shows that monthly no-show volumes consistently exceed average overbook counts — which explains the 625 empty-seat departures — but with month-specific variation. Plot 2 identifies specific routes where the compensation rate exceeds 20%, making them priority targets for a revised overbook policy. Plot 3 is operationally the most critical: it maps the exact threshold where no-shows fail to absorb the overbook, and where compensation becomes inevitable. Plot 4 shows that small aircraft are booked at a proportionally higher rate, despite their narrower absolute buffer. Plot 5 maps the compensation rate (% of flights with a denied-boarding event) across every month × aircraft-type combination, immediately highlighting which cells — by season and fleet type — carry the highest compensation risk and should receive the first overbook-limit revisions.

Chart type justification: The line chart (Plot 1) was preferred over a bar chart to emphasise temporal trends and the convergence/divergence between the two series. The jitter scatter with diagonal reference (Plot 3) was preferred over a histogram because it simultaneously shows the overbook level, the no-show level, and the compensation outcome for every flight. The diverging heatmap (Plot 5) was preferred over small-multiple bars because it encodes both the direction and magnitude of net exposure across two categorical dimensions simultaneously.


7 Technique 3 — Hypothesis Testing

7.1 Theory Recap

Hypothesis testing is a formal decision-making framework that quantifies whether an observed pattern is likely to have arisen by chance. Adi (2026, Chapter 6) outlines the standard procedure in five steps: (1) state hypotheses — a null hypothesis H₀ (no effect) and an alternative hypothesis H₁ (effect exists); (2) check assumptions — verify whether the data meets the requirements of the chosen test (e.g., normality, homogeneity of variance); (3) compute a test statistic and its sampling distribution under H₀; (4) compare the p-value to α = 0.05 — if p < 0.05, reject H₀; and (5) report effect size to assess practical (not just statistical) significance.

When normality assumptions are violated — as they routinely are with skewed count data — non-parametric alternatives are preferred (Adi, 2026, Chapter 6). The Kruskal-Wallis test is a rank-based analogue of one-way ANOVA that tests whether medians differ across three or more independent groups without assuming any particular distributional shape. If significant, pairwise Wilcoxon tests with Bonferroni correction identify which specific group pairs differ. For relationships between two categorical variables, the chi-squared test of independence examines whether the row variable and column variable are associated, based on the difference between observed and expected cell frequencies. Effect size metrics — such as epsilon-squared (ε²) for Kruskal-Wallis and Cramér’s V for chi-squared — translate statistical significance into a measure of practical magnitude, which is essential for deciding whether a policy change is warranted.

Business justification: Observed differences in compensation rates across aircraft types and months could be due to chance variation in no-show behaviour rather than systemic overbooking policy differences. Hypothesis testing provides defensible statistical evidence to justify a formal policy change — specifically, the adoption of separate overbook factors by aircraft type and separate monthly overbook limits — before taking those recommendations to the Revenue Management team.

7.2 Hypothesis 1 — Does the net exposure (Overbook − NoShow) differ significantly across aircraft types?

\[H_0: \tilde{\mu}_{Small} = \tilde{\mu}_{Medium} = \tilde{\mu}_{Large} \quad \text{(median net exposure is equal across aircraft types)}\] \[H_1: \text{At least one aircraft type has a different median net exposure}\]

Code
# Step 1: Check normality per group (Anderson-Darling, robust for large n)
for (grp in levels(df$ac_type)) {
  x <- df |> filter(ac_type == grp) |> pull(net_exposure)
  cat(grp, "— Anderson-Darling:\n")
  print(ad.test(x))
  cat("\n")
}
Small (70) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 90.484, p-value < 2.2e-16


Medium (146) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 76.577, p-value < 2.2e-16


Large (149) — Anderson-Darling:

    Anderson-Darling normality test

data:  x
A = 73.696, p-value < 2.2e-16
Code
# Step 2: Non-normal data → Kruskal-Wallis (non-parametric ANOVA)
kw1 <- kruskal.test(net_exposure ~ ac_type, data = df)
print(kw1)

    Kruskal-Wallis rank sum test

data:  net_exposure by ac_type
Kruskal-Wallis chi-squared = 15.776, df = 2, p-value = 0.0003752
Code
# Step 3: Post-hoc pairwise comparisons (Wilcoxon, Bonferroni-adjusted)
pw1 <- pairwise.wilcox.test(df$net_exposure, df$ac_type,
                             p.adjust.method = "bonferroni", exact = FALSE)
pw1$p.value |>
  as.data.frame() |>
  rownames_to_column("Type A") |>
  pivot_longer(-`Type A`, names_to = "Type B", values_to = "p_adj") |>
  filter(!is.na(p_adj)) |>
  kable(caption = "Table 6: Pairwise Aircraft Type Comparisons — Net Exposure",
        col.names = c("Aircraft A","Aircraft B","Bonferroni-adjusted p"),
        digits = 4) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 6: Pairwise Aircraft Type Comparisons — Net Exposure
Aircraft A Aircraft B Bonferroni-adjusted p
Medium (146) Small (70) 0.0007
Large (149) Small (70) 1.0000
Large (149) Medium (146) 0.0069
Code
# Step 4: Effect size (epsilon-squared for Kruskal-Wallis)
effectsize::rank_epsilon_squared(net_exposure ~ ac_type, data = df)
Epsilon2 (rank) |       95% CI
------------------------------
7.33e-03        | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
# Summary statistics per aircraft type
df |>
  group_by(ac_type) |>
  summarise(
    n              = n(),
    Median_NetExp  = round(median(net_exposure), 2),
    Mean_NetExp    = round(mean(net_exposure), 2),
    Pct_Comp       = paste0(round(mean(compensate > 0)*100, 1), "%")
  ) |>
  kable(caption = "Table 7: Net Exposure Summary by Aircraft Type",
        col.names = c("Aircraft Type","n","Median Net Exposure","Mean Net Exposure",
                      "% Flights with Compensation")) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 7: Net Exposure Summary by Aircraft Type
Aircraft Type n Median Net Exposure Mean Net Exposure % Flights with Compensation
Small (70) 682 0 -1.94 9.5%
Medium (146) 802 0 -1.30 17.8%
Large (149) 669 0 -1.97 12.9%

Interpretation: The test result is statistically significant (p < 0.05), which means we reject H0 and accept H1.

In plain language: the three aircraft types do not share the same compensation risk — each one behaves differently. The small 70-seat aircraft, the 146-seat, and the 149-seat aircraft each have a distinct pattern of how often sold seats exceed actual passengers, and those differences are too consistent to be down to luck. The follow-up pairwise tests confirm which specific pairs are driving that difference.

What this means for the business: Right now, if the same overbook number is applied across all aircraft, the policy is misfiring on at least one fleet type — either leaving too many empty seats or triggering too many compensation payouts. Each aircraft size needs its own separate overbook limit, set according to the no-show history for that fleet. A one-size-fits-all rule is statistically indefensible.

7.3 Hypothesis 2 — Is the compensation rate (% of flights with Compensate > 0) independent of month?

\[H_0: \text{The probability of a compensation event is equal across all months}\] \[H_1: \text{The probability of a compensation event differs by month}\]

Code
# Step 1: Build a contingency table: Month × Compensation (Yes/No)
cont_tbl <- table(df$month_lbl, df$compensate > 0)
colnames(cont_tbl) <- c("No Compensation","Compensated")

cont_tbl |>
  as.data.frame.matrix() |>
  rownames_to_column("Month") |>
  kable(caption = "Table 8: Contingency Table — Month × Compensation Event") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 8: Contingency Table — Month × Compensation Event
Month No Compensation Compensated
Jan 81 14
Feb 139 15
Mar 116 12
Apr 275 36
May 195 26
Jun 157 21
Jul 206 31
Aug 222 33
Sep 179 29
Oct 92 14
Nov 79 19
Dec 118 44
Code
# Step 2: Chi-squared test of independence
chi2 <- chisq.test(cont_tbl)
print(chi2)

    Pearson's Chi-squared test

data:  cont_tbl
X-squared = 34.417, df = 11, p-value = 0.000309
Code
# Step 3: Effect size — Cramér's V
cramers_v(cont_tbl)
Cramer's V (adj.) |       95% CI
--------------------------------
0.10              | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Interpretation: The test result is statistically significant (p < 0.05), which means we reject H0 and accept H1.

In plain language: the time of year genuinely affects how often a flight ends up compensating a passenger. Compensation events do not spread evenly across the calendar — some months produce significantly more denied-boarding situations than others, and the data confirms this is a real pattern, not random fluctuation.

What this means for the business: A flat overbook policy that applies the same limit in January and in August is not appropriate. The Revenue Management team should introduce a monthly overbook calendar — raising the limit in months where no-shows are historically high (so spare capacity is rarely a problem) and lowering it in months where no-shows are low and compensation events cluster. This single adjustment could meaningfully reduce compensation payouts without giving up revenue.


8 Technique 4 — Correlation Analysis

8.1 Theory Recap

Correlation analysis quantifies the direction and strength of the relationship between two variables. Adi (2026, Chapter 8) introduces three correlation coefficients, each suited to different data types and assumptions. Pearson’s r measures the linear association between two continuous variables and assumes bivariate normality; it ranges from −1 (perfect negative linear relationship) to +1 (perfect positive), with 0 indicating no linear association. Spearman’s ρ (rho) is a rank-based version of Pearson’s r that measures monotonic association — it is robust to non-normal distributions, outliers, and ordinal data, making it the preferred coefficient when data are skewed or contain extreme values, as with count variables in this dataset. Kendall’s τ (tau) is based on concordant and discordant pairs and is preferred for small samples or data with many tied ranks.

Partial correlation extends bivariate correlation by estimating the relationship between two variables after statistically controlling for one or more confounders — for example, the correlation between overbook count and compensation after removing the shared influence of no-show volume. A correlation matrix simultaneously displays all pairwise correlations across selected variables; a heatmap visualisation allows the analyst to spot strong relationships (dark cells) and concerning collinearity among predictors at a glance, with non-significant cells blanked out. The most important conceptual principle, emphasised by Adi (2026, Chapter 8), is that correlation does not imply causation: a high ρ can reflect a common cause, reverse causation, or a spurious coincidence. Causal inference requires experimental design or quasi-experimental methods, not correlation alone.

Business justification: Before building a predictive model, I need to determine whether it is the overbook rate or the no-show rate that is the stronger driver of compensation — and whether the two are so correlated that including both in a regression would inflate standard errors through collinearity. At station level, this distinction matters practically: if overbook count is the dominant driver, the policy lever is Revenue Management’s booking-limit decision; if no-show count dominates, the lever is Revenue Management’s no-show forecasting model and the ticketing policy (e.g., stricter cancellation fees that reduce unannounced no-shows) — neither of which is controlled at station level. Correlation analysis answers this question quantitatively.

Code
num_df <- df |>
  select(overbook, no_show, compensate, empty_seats, ob_rate, ns_rate,
         net_exposure, capacity, month_num) |>
  rename(
    Overbook       = overbook,
    `No-Show`      = no_show,
    Compensate     = compensate,
    `Empty Seats`  = empty_seats,
    `OB Rate`      = ob_rate,
    `NS Rate`      = ns_rate,
    `Net Exposure` = net_exposure,
    Capacity       = capacity,
    Month          = month_num
  )

# Spearman correlation (robust to non-normality and outliers)
cor_mat  <- cor(num_df, method = "spearman", use = "complete.obs")
cor_pmat <- cor_pmat(num_df, method = "spearman")

round(cor_mat, 3) |>
  kable(caption = "Table 9: Spearman Correlation Matrix") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)
Table 9: Spearman Correlation Matrix
Overbook No-Show Compensate Empty Seats OB Rate NS Rate Net Exposure Capacity Month
Overbook 1.000 0.537 0.323 -0.042 0.838 0.397 0.183 0.194 0.209
No-Show 0.537 1.000 -0.190 0.736 0.438 0.920 -0.660 0.133 0.102
Compensate 0.323 -0.190 1.000 -0.249 0.292 -0.253 0.663 0.040 0.090
Empty Seats -0.042 0.736 -0.249 1.000 -0.048 0.732 -0.890 0.010 -0.010
OB Rate 0.838 0.438 0.292 -0.048 1.000 0.510 0.173 -0.301 0.135
NS Rate 0.397 0.920 -0.253 0.732 0.510 1.000 -0.686 -0.202 0.042
Net Exposure 0.183 -0.660 0.663 -0.890 0.173 -0.686 1.000 0.011 0.050
Capacity 0.194 0.133 0.040 0.010 -0.301 -0.202 0.011 1.000 0.100
Month 0.209 0.102 0.090 -0.010 0.135 0.042 0.050 0.100 1.000
Code
ggcorrplot(
  cor_mat,
  hc.order   = TRUE,
  type       = "lower",
  p.mat      = cor_pmat,
  lab        = TRUE,
  lab_size   = 3.5,
  sig.level  = 0.05,
  insig      = "blank",
  colors     = c(pal_orange, "white", pal_blue),
  title      = "Figure 2: Spearman Correlation Matrix — Overbooking Variables (p < 0.05 shown)",
  ggtheme    = theme_minimal(base_size = 11)
) +
  theme(plot.title = element_text(face = "bold", size = 12))

Figure 2: Correlation Heatmap with Significance Masking

Discussion of the 3 strongest correlations:

  1. Net Exposure ↔︎ Compensate (strongest positive). Net Exposure = Overbook − NoShow is by construction the immediate trigger for compensation. A strong positive correlation is expected and confirmed. Business implication: Managing net exposure — either by reducing the overbook count or by accurately forecasting no-show volumes — is the single most effective lever for reducing compensation events.

  2. Overbook ↔︎ Compensate (positive). Higher overbook counts are associated with more compensation events. However, this correlation is moderated by the no-show level: on flights with high no-show volumes, even large overbook counts produce no compensation. Business implication: Overbook count alone cannot be used as a compensation predictor without conditioning on the no-show buffer. This motivates the regression model in Technique 5.

  3. No-Show ↔︎ Empty Seats (positive). More no-shows are associated with more empty seats at departure. This captures the other side of the trade-off: when no-shows exceed the overbook buffer, the aircraft departs with vacant capacity. Business implication: Routes where no-show rates are persistently high and not matched by a sufficient overbook factor are generating systematic revenue leakage. These routes should receive increased overbook limits.

Correlation ≠ causation: The positive correlation between overbook count and compensation does not imply that overbooking alone causes denial events — no-show behaviour is the mediating variable. A controlled trial (applying a zero-overbook policy on a subset of routes for one quarter) would be required to isolate the causal effect of the booking-limit decision. The correlation analysis motivates the regression model but cannot replace it.


9 Technique 5 — Linear Regression

9.1 Theory Recap

Ordinary Least Squares (OLS) linear regression estimates the linear relationship between a continuous outcome variable (Y) and one or more predictor variables (X₁, X₂, …, Xₖ) by finding the coefficient values that minimise the sum of squared residuals. Adi (2026, Chapter 9) covers the full OLS pipeline in four stages.

Specification defines the model equation: Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε, where β₀ is the intercept (the predicted Y when all predictors are zero), each βᵢ is the expected change in Y for a one-unit increase in Xᵢ holding all other predictors constant, and ε is the error term. Categorical predictors enter the model as dummy variables, with one category serving as the baseline reference level whose effect is absorbed into the intercept.

Estimation and inference involves computing OLS coefficient estimates, their standard errors, t-statistics (βᵢ / SE(βᵢ)), and two-sided p-values. A p-value below 0.05 indicates that the coefficient is statistically significantly different from zero. The 95% confidence interval for each coefficient provides the range of plausible values consistent with the data.

Model fit is assessed via R² (the proportion of variance in Y explained by the predictors) and adjusted R² (which penalises for the number of predictors added). The residual standard error (σ̂) measures the typical prediction error in the same units as Y. The F-statistic tests whether the model as a whole explains significantly more variance than an intercept-only baseline.

Diagnostics verify the four OLS assumptions (Adi, 2026, Chapter 9): (1) linearity — residuals vs. fitted plot should show a horizontal band around zero; (2) homoscedasticity — constant residual variance across the fitted value range; (3) normality of residuals — Q-Q plot should follow a straight diagonal; and (4) absence of influential outliers — Cook’s Distance plot identifies observations that disproportionately determine the coefficient estimates. Variance Inflation Factors (VIF > 5) signal problematic multicollinearity among predictors.

Business justification: As Airport Operations Manager, every compensation event I manage at the gate generates a cascading sequence of costs — passenger re-accommodation, ground delay, complaint records, and sometimes social media exposure. The regression model does not serve as a real-time gate tool: no-show counts are only finalised at check-in counter closure, and once they are known the exact number of passengers requiring compensation is determined directly by the absorption rule (compensate = max(overbook − no_show, 0)), making a regression prediction redundant at that point. The model’s operational value lies upstream, in the policy-setting decision: the regression coefficients quantify — for Revenue Management — the expected compensation burden generated by each additional oversold seat, controlling for aircraft type and seasonal no-show variation. The outputs (Table 12 and Figure 5) allow the overbook limit to be set in advance at the level where expected compensation cost is minimised relative to lost-revenue risk, before the flight ever opens for sale.

9.2 Model specification

The outcome is compensate (passengers requiring compensation per departure). Predictors are:

  • overbook — the primary policy variable
  • no_show — the natural absorber of overbook exposure
  • ac_type — aircraft type (proxy for route scale and booking-management practice)
  • month_num — captures seasonal no-show variation

seat_sold and flown are excluded due to algebraic collinearity with overbook and capacity.

Code
mod <- lm(
  compensate ~ overbook + no_show + ac_type + month_num,
  data = df
)

tidy(mod, conf.int = TRUE) |>
  mutate(
    sig = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  kable(
    caption   = "Table 10: OLS Regression Coefficients — Passenger Compensation",
    col.names = c("Term","Estimate","Std Error","t-statistic","p-value",
                  "95% CI Lower","95% CI Upper","Sig."),
    digits    = 4
  ) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 10: OLS Regression Coefficients — Passenger Compensation
Term Estimate Std Error t-statistic p-value 95% CI Lower 95% CI Upper Sig.
(Intercept) -0.1368 0.0612 -2.2367 0.0254 -0.2568 -0.0169 *
overbook 0.3350 0.0092 36.4417 0.0000 0.3170 0.3530 ***
no_show -0.1325 0.0058 -23.0309 0.0000 -0.1438 -0.1212 ***
ac_typeMedium (146) -0.0307 0.0557 -0.5510 0.5817 -0.1400 0.0786
ac_typeLarge (149) -0.1312 0.0564 -2.3278 0.0200 -0.2418 -0.0207 *
month_num 0.0036 0.0074 0.4784 0.6324 -0.0110 0.0181
Code
glance(mod) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  kable(
    caption   = "Table 11: Model Fit Statistics",
    col.names = c("R²","Adj R²","Residual SE","F-statistic","p-value","df","n"),
    digits    = 4
  ) |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 11: Model Fit Statistics
Adj R² Residual SE F-statistic p-value df n
0.4135 0.4121 1.0176 302.7134 0 5 2153
Code
cat("Variance Inflation Factors (VIF):\n")
Variance Inflation Factors (VIF):
Code
vif(mod)
              GVIF Df GVIF^(1/(2*Df))
overbook  1.359601  1        1.166019
no_show   1.240797  1        1.113911
ac_type   1.110343  2        1.026513
month_num 1.076900  1        1.037738
Code
par(mfrow = c(2, 2))
plot(mod, which = 1:4, col = adjustcolor(pal_blue, 0.5), pch = 16, cex = 0.6)

Figure 3: Regression Diagnostic Plots
Code
par(mfrow = c(1, 1))
Code
df_aug <- augment(mod, df)

ggplot(df_aug, aes(x = .fitted, y = compensate, colour = ac_type)) +
  geom_point(alpha = 0.35, size = 1.6) +
  geom_abline(colour = "red", linewidth = 1, linetype = "dashed") +
  scale_colour_manual(values = pal_3, name = "Aircraft Type") +
  labs(
    title    = "Figure 4: Predicted vs Actual Compensation Events",
    subtitle = "Points on the dashed line = perfect prediction; spread indicates residual variance",
    x = "Predicted Compensation (pax)", y = "Actual Compensation (pax)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Figure 4: Predicted vs Actual Compensation Events
Code
# Derive the overbook level at which expected compensation = 0
# From the model: compensate_hat = b0 + b1*overbook + b2*no_show + ...
# Setting compensate_hat = 0 and solving for overbook given mean no_show:

coefs     <- coef(mod)
mean_ns   <- mean(df$no_show)
mean_mn   <- mean(df$month_num)

# OB at which E[compensate] = 0 for each aircraft type (holding other vars at mean)
calc_recommended_ob <- function(ac_level) {
  ac_dummy_med  <- as.numeric(ac_level == "Medium (146)")
  ac_dummy_sm   <- as.numeric(ac_level == "Small (70)")
  coef_med <- if ("ac_typeMedium (146)" %in% names(coefs)) coefs["ac_typeMedium (146)"] else 0
  coef_sm  <- if ("ac_typeSmall (70)"   %in% names(coefs)) coefs["ac_typeSmall (70)"]   else 0
  intercept_adj <- coefs["(Intercept)"] +
    coefs["no_show"] * mean_ns +
    coefs["month_num"] * mean_mn +
    coef_med * ac_dummy_med +
    coef_sm  * ac_dummy_sm
  recommended <- -intercept_adj / coefs["overbook"]
  return(round(unname(recommended), 2))
}

recommended_tbl <- tibble(
  `Aircraft Type` = c("Large (149)","Medium (146)","Small (70)"),
  `Recommended Overbook Count (OLS)` = sapply(c("Large (149)","Medium (146)","Small (70)"),
                                               calc_recommended_ob),
  `Mean Observed Overbook`           = df |>
    group_by(ac_type) |> summarise(m = round(mean(overbook), 2)) |> pull(m)
) |>
  mutate(
    `Gap (Recommended − Observed)` = round(`Recommended Overbook Count (OLS)` - `Mean Observed Overbook`, 2)
  )

recommended_tbl |>
  kable(caption = "Table 12: Estimated Low-Risk Overbook Levels by Aircraft Type",
        align = c("l","r","r","r")) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 12: Estimated Low-Risk Overbook Levels by Aircraft Type
Aircraft Type Recommended Overbook Count (OLS) Mean Observed Overbook Gap (Recommended − Observed)
Large (149) 2.54 2.78 -0.24
Medium (146) 2.63 4.65 -2.02
Small (70) 2.54 3.97 -1.43

Figure 5: Recommended Overbooking Threshold Estimation

Code
# Trade-off curve: overbook vs predicted compensate for each aircraft type
ob_seq <- seq(1, 20, by = 0.5)
predict_df <- crossing(
  overbook  = ob_seq,
  ac_type   = factor(c("Large (149)","Medium (146)","Small (70)"),
                     levels = levels(df$ac_type)),
  no_show   = mean_ns,
  month_num = mean_mn
)
predict_df$.pred <- predict(mod, newdata = predict_df)

ggplot(predict_df, aes(x = overbook, y = pmax(.pred, 0), colour = ac_type)) +
  geom_line(linewidth = 1.3) +
  geom_hline(yintercept = 0, linetype = "dotted", colour = "grey50") +
  scale_colour_manual(values = pal_3, name = "Aircraft Type") +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  labs(
    title    = "Figure 5: Predicted Compensation vs Overbook Count by Aircraft Type",
    subtitle = paste0("Holding No-Show = ", round(mean_ns, 1),
                      " (mean) and Month = ", round(mean_mn, 0),
                      " (mean). Recommended threshold = where line crosses zero."),
    x = "Overbooked Seats per Flight", y = "Predicted Compensation Events (pax)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Figure 5: Recommended Overbooking Threshold Estimation

Coefficient interpretation (business language):

Term Direction Operational Translation
overbook + Each additional seat oversold increases predicted compensation by β₁ passengers — set the overbook limit below the point where this prediction crosses your compensation-cost threshold
no_show Each additional no-show reduces predicted compensation by
ac_type [Medium] ± If medium aircraft shows a significantly different intercept, it requires a separately calibrated overbook limit, not a scaled version of the large-aircraft limit
ac_type [Small] ± Small-aircraft services operate on thinner no-show buffers; a statistically significant coefficient justifies a conservative, independently-set limit
month_num ± A significant positive coefficient means compensation risk rises as the year progresses; a negative coefficient means risk is front-loaded — either pattern implies that a flat annual overbook rate fails to minimise the combined impact of empty seats and denied boardings

Model limitations: OLS assumes normally distributed, continuous residuals with homoscedastic variance. Because compensate is a non-negative integer count (zero-inflated), these assumptions are violated. The OLS model is used here as specified by the assignment (Ch. 9) and provides interpretable directional coefficients; however, Section 11 identifies a Zero-Inflated Negative Binomial GLM as the preferred modelling approach for operational deployment.


10 Integrated Findings

The five techniques collectively answer the central question — what is the recommended overbooking threshold? — from five complementary angles:

Technique Key Finding Recommended OB Rate Implication
EDA No-show mean (5.56) exceeds overbook mean (3.85); 625 flights still departed with empty seats The current overbook factor is below the no-show mean on most routes — there is room to increase it without raising compensation frequency
Visualisation Compensation risk is concentrated on specific routes (>20% frequency) and in specific month × aircraft combinations A uniform network-wide factor fails these segments; a differentiated matrix is required
Hypothesis Testing Net exposure and compensation probability differ significantly across aircraft types and months Statistically justifies separate overbook factors by aircraft type and a monthly limit calendar
Correlation Net exposure (Overbook − NoShow) is the dominant driver of compensation; no-show rate is the dominant driver of empty seats The recommended threshold balances these two forces; it is the point where expected net exposure ≈ 0 for the median no-show on that route
Regression Compensation is a predictable linear function of overbook count, no-show count, aircraft type, and month The regression coefficients provide a closed-form formula for the break-even overbook count; Table 12 (Estimated Low-Risk Overbook Levels by Aircraft Type) quantifies this per aircraft type

Single recommendation: Implement a route × aircraft-type × month overbook matrix where the target overbook count per cell is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is selected for three reasons: (i) asymmetric costs — denied boarding incurs compensation payments, NCAA regulatory exposure, and measurable brand damage, whereas an empty seat costs only marginal lost revenue, making a higher threshold preferable to the median; (ii) operational tolerance — internal stakeholder consultation indicates that a compensation event rate below 20% per departure is operationally manageable for gate teams and acceptable to the revenue management function; (iii) empirical fit — the dataset shows that on 80% of observed departures, no-shows equalled or exceeded 4 seats, meaning an 80th-percentile threshold translates to a concrete, immediately actionable overbook count rather than a theoretical ideal. Setting the target at the 80th percentile therefore ensures the overbook buffer is absorbed on 80% of all departures (eliminating empty seats on 80% of flights), while compensation events occur only on the 20% of flights where no-shows fall below that threshold. The regression model informs this pre-flight policy decision — it is not a gate-close tool. Once check-in closes and the final no-show count is known, the required number of re-accommodations is determined directly by max(overbook − no_show, 0); at that point, gate teams initiate the voluntary off-load announcement based on that direct count, not on a model prediction.


11 Limitations & Further Work

  1. We don’t have the money numbers. The dataset records how many passengers were compensated but not how much each compensation cost, what fare each passenger paid, or what a filled empty seat was worth. This means the analysis can only say “fewer compensation events is better” — it cannot calculate the actual naira profit or loss associated with each overbook decision. To get a true break-even overbook rate, Revenue Management would need to feed in the average fare per route and the regulatory compensation amount per denied passenger. That combined figure would replace the current count-based recommendation with a genuine profit-maximising one.

  2. Only one year of data was available. The analysis is based entirely on 2025 flights. This means we cannot tell whether the patterns observed — for example, higher no-shows in certain months — are stable features of the business or were driven by something unusual that happened in 2025 (a fare promotion, a competitor disruption, fuel price spikes). At least three years of data would be needed before the seasonal recommendations can be applied with full confidence. Until then, the monthly overbook calendar should be treated as a starting point to be reviewed after each year.

  3. The data shows association, not proof of cause. The finding that higher overbook counts lead to more compensation is based on observing what happened historically — it is not the result of a controlled experiment. It is possible that the routes with the highest overbook counts are also the routes where passengers are least likely to no-show for other reasons (for example, business-heavy routes where travellers rarely cancel). If that is the case, the relationship between overbooking and compensation would look different on those routes from what the overall model predicts. The safest way to confirm the findings would be to deliberately change the overbook limit on a small set of routes for a defined period, then compare outcomes against routes where nothing changed.

  4. Each route has its own personality that the model does not fully capture. The regression groups flights by aircraft size and month, but a Lagos–Abuja morning flight and a Benin–Port Harcourt evening flight are very different products with different passenger types, booking patterns, and no-show behaviour. Treating them the same introduces error. A more sophisticated model would learn a separate baseline for each route and then adjust up or down from there based on shared factors like aircraft type and season. The practical next step would be to produce a route-by-route overbook recommendation table, rather than a single network-wide figure, and review it quarterly with the commercial team.


12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Code
# APA 7th edition citations for all R packages used in this analysis.
# Retrieved via citation("packagename") and formatted as APA 7th edition text.
pkgs_cite <- c("janitor", "skimr", "ggcorrplot", "car",
               "broom", "kableExtra", "patchwork",
               "nortest", "effectsize", "lmtest", "scales", "lubridate")
for (p in pkgs_cite) {
  tryCatch({
    cit <- format(citation(p), style = "text")[1]
    cat(cit, "\n\n")
  }, error = function(e) {
    cat(sprintf("*%s* package. See package documentation on CRAN.\n\n", p))
  })
}

Firke S (2024). janitor: Simple Tools for Examining and Cleaning Dirty Data. doi:10.32614/CRAN.package.janitor https://doi.org/10.32614/CRAN.package.janitor, R package version 2.2.1, https://CRAN.R-project.org/package=janitor.

Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2026). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.2.2, https://CRAN.R-project.org/package=skimr.

Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.

Gross J, Ligges U (2015). nortest: Tests for Normality. doi:10.32614/CRAN.package.nortest https://doi.org/10.32614/CRAN.package.nortest, R package version 1.0-4, https://CRAN.R-project.org/package=nortest.

Ben-Shachar MS, Lüdecke D, Makowski D (2020). “effectsize: Estimation of Effect Size Indices and Standardized Parameters.” Journal of Open Source Software, 5(56), 2815. doi:10.21105/joss.02815 https://doi.org/10.21105/joss.02815, https://doi.org/10.21105/joss.02815.

Zeileis A, Hothorn T (2002). “Diagnostic Checking in Regression Relationships.” R News, 2(3), 7-10. https://CRAN.R-project.org/doc/Rnews/.

Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. doi:10.32614/CRAN.package.scales https://doi.org/10.32614/CRAN.package.scales, R package version 1.4.0, https://CRAN.R-project.org/package=scales.

Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.


13 Appendix: AI Usage Statement

Claude (Anthropic), an AI coding assistant, was used in this project to: (i) generate the initial structure of this Quarto document and scaffold R code chunks for each of the five required techniques; (ii) suggest appropriate statistical tests given the data structure (e.g., recommending Kruskal-Wallis over ANOVA after EDA confirmed non-normal count distributions); (iii) draft narrative text templates for the business interpretation sections; and (iv) construct the Python conversion script that transformed the binary .xlsb source file into the .csv format loaded by R.

All analytical decisions — which variables to treat as the outcome and predictors, which aircraft grouping to apply, how to interpret the net exposure concept, what the recommended overbooking threshold means in operational terms, and how to frame the trade-off between empty seats and passenger compensation — were made independently by the author based on direct operational experience as an Airport Operations Manager at a Nigerian domestic airline and on the course material from Prof Bongo Adi’s Data Analytics 1 programme at Lagos Business School. The author verified every result against the actual rendered output and takes full responsibility for all interpretations and conclusions presented in this document.

The underlying dataset was extracted directly from the airline’s Departure Control System. No data was fabricated, simulated, or otherwise altered.