Overbooking Analytics: Exploratory and Inferential Analysis of a Nigerian Domestic Airline’s Flight Operations (2025)
Author
Ogechi Agu-Uzochukwu
Published
May 18, 2026
1 Executive Summary
A Nigerian domestic airline practises deliberate overbooking — selling more seats than aircraft capacity — to offset revenue losses from passenger no-shows. The core trade-off is precise: flying with empty seats costs the airline lost fare revenue; denying a passenger boarding costs compensation, regulatory liability, and brand damage. Neither extreme is acceptable. The recommended overbooking threshold sits at the point where no-show absorption and overbook exposure are in balance.
This study analyses 2,153 domestic flight departures across the full calendar year 2025, covering 16 routes between seven airports: Abuja (ABV), Lagos (LOS), Port Harcourt (PHC), Benin (BNI), Jos (JOS), Asaba (ABB), and Warri (QRW). Three aircraft types are operated — 70-seat, 146-seat, and 149-seat — each with distinct booking dynamics. Five analytical techniques (EDA, Visualisation, Hypothesis Testing, Correlation, and Linear Regression) are applied to determine the structural drivers of overbooking efficiency and passenger compensation.
Key findings:(i) the airline currently overbooks at a mean rate of 3.85 seats per flight, but no-shows average 5.56, meaning the overbook buffer is consistently absorbed on most flights; (ii) on 294 flights (13.7% of all departures), overbooking exceeded no-shows, triggering passenger compensation; (iii) 625 flights (29%) still carried empty seats despite overbooking, indicating the current overbook factor is below the no-show mean on many routes; (iv) aircraft type and month are significantly associated with compensation event frequency, based on non-parametric and chi-squared hypothesis testing.
Recommendation: The airline should adopt a route × aircraft-type × month overbook rate matrix where the target overbook count is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is chosen because it represents a deliberate asymmetry: the cost of one denied boarding (compensation, regulatory liability, and reputational damage) materially exceeds the cost of one empty seat (foregone marginal revenue only), so the threshold is set above the median to absorb no-shows on most departures while accepting a controlled residual risk of 20% rather than 50%. This eliminates empty-seat losses on 80% of departures while accepting a controlled compensation event rate of approximately 20%.
2 Professional Disclosure
Role: Airport Operations Manager at a Nigerian domestic airline — responsible for station operations, ensuring safe and on-time flight departures, customer service efficiency, and passenger satisfaction. The role controls station operational costs, manages flight delays, cancellations, and operational disruptions, resolves customer complaints, and oversees the ground operations team.
Organisation: A Nigerian domestic airline operating scheduled passenger services across seven airports in the domestic aviation sector. The organisation’s core business is time-critical passenger movement on high-frequency short-haul routes, where seat yield, on-time performance, and passenger experience are the primary operational and commercial metrics.
Technique relevance to daily work:
#
Technique
Why it matters in this role
1
EDA
I review daily departure data to identify patterns in overbooking, no-shows, and compensation events. EDA surfaces distributional anomalies — e.g., atypically high overbook counts on specific routes — that would be buried in weekly summary reports
2
Visualisation
Station performance dashboards are presented to the Head of Ground Operations every Monday. Clear visualisations of the overbook-vs-no-show balance and compensation frequency translate operational data into actions that non-technical managers can implement immediately
3
Hypothesis Testing
When compensation incidents spike on a route, I need statistical evidence to determine whether the pattern is a genuine policy failure or random variation before escalating a formal policy-change request to Revenue Management
4
Correlation
When compensation events spike, I need to identify which variable is the dominant driver before escalating. If overbook count drives compensation, the case goes to Revenue Management to revise the booking limit. If no-show variability drives it, the case goes to Revenue Management’s demand forecasting function to recalibrate the no-show model. Correlation analysis tells me which argument to make and to whom
5
Linear Regression
Every compensation event I manage at the gate generates costs I am accountable for: passenger re-accommodation, delay, complaints, and regulatory exposure. When I present a policy-change request to Revenue Management, anecdote is insufficient — I need a quantified argument. A regression model that relates compensation frequency to the overbook rate, aircraft type, and month gives me that argument: the coefficients show exactly how many additional compensation events each extra oversold seat produces, by aircraft type and by season. Table 12 and Figure 5 express this as a concrete overbook-rate matrix that Revenue Management can act on directly
3 Data Collection & Sampling
Source: Airline departure control system (DCS).
Collection method: Extracted DCS data for the period and saved it in Excel format.
Sampling frame: All domestic scheduled departures in calendar year 2025, across seven airports and 16 origin–destination routes.
# Source file: Overbook_Jay_new.xlsb extracted from DCS and saved as Excel,# then converted to CSV using pyxlsb before loading into Rdf_raw <-read_csv("Overbook_Jay_new.csv",col_types =cols(`Flt No`=col_double(),`Dep Port`=col_character(),`Arr Port`=col_character(),`Route`=col_character(),`Sch Dep Dt`=col_date(format ="%Y-%m-%d"),`Months`=col_double(),`Capacity`=col_double(),`Seat Sold`=col_double(),`Overbook`=col_double(),`Flown`=col_double(),`NoShow`=col_double(),`Compensate`=col_double() ))
Ethical note: The dataset contains no personally identifiable information (PII). All records are flight-level aggregates — no passenger names, booking references, or employee IDs are present. Route-level performance statistics are standard internal reporting at the airline and are published here solely for academic analysis under Lagos Business School supervision.
Key observations: The overbook distribution is right-skewed (most flights oversell 2–5 seats; a small number oversell 20+). Compensate (denied boarding with compensation) is heavily zero-inflated — 1,859 of 2,153 flights (86.3%) had zero compensation events. No-show counts are right-skewed with a mode around 4–6, a mean of 5.56, and a long right tail extending to 20 — consistent with count data exhibiting overdispersion rather than a normal distribution. The fleet splits across three aircraft types: Small 70-seat (682 flights), Medium 146-seat (802 flights), and Large 149-seat (669 flights).
Business justification: EDA is the mandatory first step before any overbooking policy recommendation. It reveals whether the data is clean, identifies operational anomalies, and determines whether the overbook rate is systematically above or below the no-show buffer across different route and aircraft segments. At a Nigerian domestic airline, a single high-overbook departure (e.g., a mis-priced group block) can distort network-level averages if not identified and understood before building a predictive model.
print("=== Flights with empty seats:", (df_py["empty_seats"] >0).sum())
=== Flights with empty seats: 625
EDA findings:
Data quality confirmed — no missing values. All 2,153 rows are complete across all 12 source variables. Because the dataset contains the full population of departures within the study period, traditional sampling-selection concerns are reduced.
The absorption rule holds perfectly. Compensation events occur if and only if Overbook > NoShow (i.e., net_exposure > 0). This is not an accounting artefact — it reflects the airline’s confirmed policy: every seat sold beyond no-show absorption results in a denied boarding with compensation. The rule holds in 100% of rows.
Revenue leakage from empty seats (625 flights, 29%). Despite overbooking on every departure, 625 flights still landed with empty seats — meaning the current overbook factor was insufficient to compensate for the actual no-show volume on those days. This is the underperformance side of the trade-off and a direct revenue loss signal.
Outlier overbook flights exist but are operationally explainable. Flights where overbook > 12 (IQR upper fence) represent charter-augmented or group-movement departures. They are retained as genuine observations but noted in regression diagnostics.
6 Technique 2 — Data Visualisation
Book reference: Chapter 5 — Grammar of graphics, chart selection, storytelling with data.
Business justification: Revenue Management and Operations leadership are not data scientists. A well-chosen visualisation removes ambiguity from a statistical summary and makes seasonal and route-level patterns immediately actionable. The five plots below tell one coherent story: where, when, and on which aircraft the balance between overbooking and no-show absorption breaks down — and what that means for the compensation budget.
p_v2 <- df |>group_by(route) |>summarise(Comp_Rate =mean(compensate >0) *100,Avg_Overbook =mean(overbook),Avg_NoShow =mean(no_show),n =n(),.groups ="drop" ) |>arrange(desc(Comp_Rate)) |>ggplot(aes(x =reorder(route, Comp_Rate), y = Comp_Rate, fill = Avg_Overbook)) +geom_col(show.legend =TRUE) +geom_text(aes(label =paste0(round(Comp_Rate, 1), "%")),hjust =-0.15, size =3.2, colour ="grey30") +coord_flip() +scale_fill_gradient(low ="#D6E8F5", high = pal_blue, name ="Avg\nOverbook") +scale_y_continuous(limits =c(0, 35), labels =function(x) paste0(x, "%")) +labs(title ="Plot 2: Compensation Rate by Route (shaded by average overbook count)",subtitle ="Some routes exceed 20% compensation frequency — a policy redesign priority",x ="Route", y ="% of Flights with Compensation Event" ) +theme_minimal(base_size =11) +theme(plot.title =element_text(face ="bold"))p_v2
Plot 2: Compensation Rate by Route
Code
p_v3 <-ggplot(df, aes(x = no_show, y = overbook, colour = denied_flag, shape = ac_type)) +geom_jitter(alpha =0.40, size =1.8, width =0.25, height =0.25) +geom_abline(intercept =0, slope =1, linetype ="dashed",colour ="grey30", linewidth =0.9) +scale_colour_manual(values =c("Compensated"= pal_orange, "No Compensation"= pal_mid),name ="Outcome") +scale_shape_manual(values =c("Large (149)"=16, "Medium (146)"=17, "Small (70)"=15),name ="Aircraft") +annotate("text", x =15, y =3, label ="← No-shows absorb overbook\n(empty seats possible)",size =3, colour = pal_mid, hjust =0) +annotate("text", x =1, y =20, label ="Overbook > No-shows\n→ Compensation →",size =3, colour = pal_orange, hjust =0) +labs(title ="Plot 3: The Overbooking Trade-Off — No-Shows vs Overbooked Seats",subtitle ="Points above the diagonal → overbook exceeds no-shows → passenger compensation required",x ="No-Shows per Departure", y ="Overbooked Seats per Departure" ) +theme_minimal(base_size =11) +theme(plot.title =element_text(face ="bold"))p_v3
Plot 3: Overbook vs No-Show trade-off by aircraft type
Code
p_v4 <-ggplot(df, aes(x = ac_type, y = ob_rate *100, fill = ac_type)) +geom_violin(trim =FALSE, alpha =0.65) +geom_boxplot(width =0.10, fill ="white", outlier.colour = pal_orange, outlier.size =1.5) +scale_fill_manual(values = pal_3) +scale_y_continuous(labels =function(x) paste0(x, "%")) +labs(title ="Plot 4: Overbooking Rate (%) Distribution by Aircraft Type",subtitle ="Small aircraft are booked at a higher percentage rate than large aircraft",x ="Aircraft Type", y ="Overbooking Rate (% of Capacity)" ) +theme_minimal(base_size =11) +theme(legend.position ="none", plot.title =element_text(face ="bold"))p_v4
Plot 4: Overbook rate distribution by aircraft type
Code
p_v5 <- df |>group_by(month_lbl, ac_type) |>summarise(Avg_NetExposure =mean(net_exposure),Comp_Rate =mean(compensate >0) *100,.groups ="drop" ) |>ggplot(aes(x = month_lbl, y = ac_type, fill = Avg_NetExposure)) +geom_tile(colour ="white", linewidth =0.6) +geom_text(aes(label =round(Avg_NetExposure, 1)), size =3.2, fontface ="bold",colour =ifelse(TRUE, "white", "black")) +scale_fill_gradient2(low = pal_green, mid ="white", high = pal_orange,midpoint =0, name ="Avg Net\nExposure") +labs(title ="Plot 5: Average Net Exposure (Overbook − No-Show) by Month & Aircraft Type",subtitle ="Orange = more overbook than no-shows (denial risk); Green = no-shows exceed overbook (empty seats)",x ="Month", y ="Aircraft Type" ) +theme_minimal(base_size =11) +theme(plot.title =element_text(face ="bold"))p_v5
Plot 5: Net Exposure Heatmap — Month x Aircraft Type
Visualisation narrative: The five charts collectively map the full overbooking trade-off. Plot 1 shows that monthly no-show volumes consistently exceed average overbook counts — which explains the 625 empty-seat departures — but with month-specific variation. Plot 2 identifies specific routes where the compensation rate exceeds 20%, making them priority targets for a revised overbook policy. Plot 3 is operationally the most critical: it maps the exact threshold where no-shows fail to absorb the overbook, and where compensation becomes inevitable. Plot 4 shows that small aircraft are booked at a proportionally higher rate, despite their narrower absolute buffer. Plot 5 provides the month × aircraft matrix that a policy redesign would directly use.
Chart type justification: The line chart (Plot 1) was preferred over a bar chart to emphasise temporal trends and the convergence/divergence between the two series. The jitter scatter with diagonal reference (Plot 3) was preferred over a histogram because it simultaneously shows the overbook level, the no-show level, and the compensation outcome for every flight. The diverging heatmap (Plot 5) was preferred over small-multiple bars because it encodes both the direction and magnitude of net exposure across two categorical dimensions simultaneously.
Business justification: Observed differences in compensation rates across aircraft types and months could be due to chance variation in no-show behaviour rather than systemic overbooking policy differences. Hypothesis testing provides defensible statistical evidence to justify a formal policy change — specifically, the adoption of separate overbook factors by aircraft type and separate monthly overbook limits — before taking those recommendations to the Revenue Management team.
7.1 Hypothesis 1 — Does the net exposure (Overbook − NoShow) differ significantly across aircraft types?
\[H_0: \tilde{\mu}_{Small} = \tilde{\mu}_{Medium} = \tilde{\mu}_{Large} \quad \text{(median net exposure is equal across aircraft types)}\]\[H_1: \text{At least one aircraft type has a different median net exposure}\]
Code
# Step 1: Check normality per group (Anderson-Darling, robust for large n)for (grp inlevels(df$ac_type)) { x <- df |>filter(ac_type == grp) |>pull(net_exposure)cat(grp, "— Anderson-Darling:\n")print(ad.test(x))cat("\n")}
Small (70) — Anderson-Darling:
Anderson-Darling normality test
data: x
A = 90.484, p-value < 2.2e-16
Medium (146) — Anderson-Darling:
Anderson-Darling normality test
data: x
A = 76.577, p-value < 2.2e-16
Large (149) — Anderson-Darling:
Anderson-Darling normality test
data: x
A = 73.696, p-value < 2.2e-16
Code
# Step 2: Non-normal data → Kruskal-Wallis (non-parametric ANOVA)kw1 <-kruskal.test(net_exposure ~ ac_type, data = df)print(kw1)
Kruskal-Wallis rank sum test
data: net_exposure by ac_type
Kruskal-Wallis chi-squared = 15.776, df = 2, p-value = 0.0003752
Table 6: Pairwise Aircraft Type Comparisons — Net Exposure
Aircraft A
Aircraft B
Bonferroni-adjusted p
Medium (146)
Small (70)
0.0007
Large (149)
Small (70)
1.0000
Large (149)
Medium (146)
0.0069
Code
# Step 4: Effect size (epsilon-squared for Kruskal-Wallis)effectsize::rank_epsilon_squared(net_exposure ~ ac_type, data = df)
Epsilon2 (rank) | 95% CI
------------------------------
7.33e-03 | [0.00, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
Code
# Summary statistics per aircraft typedf |>group_by(ac_type) |>summarise(n =n(),Median_NetExp =round(median(net_exposure), 2),Mean_NetExp =round(mean(net_exposure), 2),Pct_Comp =paste0(round(mean(compensate >0)*100, 1), "%") ) |>kable(caption ="Table 7: Net Exposure Summary by Aircraft Type",col.names =c("Aircraft Type","n","Median Net Exposure","Mean Net Exposure","% Flights with Compensation")) |>kable_styling(bootstrap_options ="striped", full_width =FALSE)
Table 7: Net Exposure Summary by Aircraft Type
Aircraft Type
n
Median Net Exposure
Mean Net Exposure
% Flights with Compensation
Small (70)
682
0
-1.94
9.5%
Medium (146)
802
0
-1.30
17.8%
Large (149)
669
0
-1.97
12.9%
Interpretation: A statistically significant Kruskal-Wallis result (p < 0.05) confirms that net exposure — and therefore compensation risk — is not uniform across aircraft types. The post-hoc pairwise tests reveal which specific pairs differ. Business implication: Applying a single network-wide overbook count across all three aircraft types is statistically unjustified. Each type should carry its own overbook limit calibrated to its historical no-show distribution. In particular, if small-aircraft routes show a significantly different net exposure profile, the policy applied to 70-seat services must be independently derived.
7.2 Hypothesis 2 — Is the compensation rate (% of flights with Compensate > 0) independent of month?
\[H_0: \text{The probability of a compensation event is equal across all months}\]\[H_1: \text{The probability of a compensation event differs by month}\]
Cramer's V (adj.) | 95% CI
--------------------------------
0.10 | [0.00, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
Interpretation: A chi-squared test on the contingency table tests whether month and the occurrence of a compensation event are independent. If p < 0.05, the month in which a flight operates significantly predicts whether a compensation event is likely. Business implication: A significant result supports the adoption of a monthly overbook calendar — the Revenue Management team should increase overbook limits in months where compensation events are rare (no-show rates are high) and reduce limits in months where compensation events cluster (no-show rates are low relative to the current overbook factor).
8 Technique 4 — Correlation Analysis
Book reference: Chapter 8 — Pearson, Spearman, Kendall; partial correlation; correlation vs. causation.
Business justification: Before building a predictive model, I need to determine whether it is the overbook rate or the no-show rate that is the stronger driver of compensation — and whether the two are so correlated that including both in a regression would inflate standard errors through collinearity. At station level, this distinction matters practically: if overbook count is the dominant driver, the policy lever is Revenue Management’s booking-limit decision; if no-show count dominates, the lever is Revenue Management’s no-show forecasting model and the ticketing policy (e.g., stricter cancellation fees that reduce unannounced no-shows) — neither of which is controlled at station level. Correlation analysis answers this question quantitatively.
Figure 2: Correlation Heatmap with Significance Masking
Discussion of the 3 strongest correlations:
Net Exposure ↔︎ Compensate (strongest positive).Net Exposure = Overbook − NoShow is by construction the immediate trigger for compensation. A strong positive correlation is expected and confirmed. Business implication: Managing net exposure — either by reducing the overbook count or by accurately forecasting no-show volumes — is the single most effective lever for reducing compensation events.
Overbook ↔︎ Compensate (positive). Higher overbook counts are associated with more compensation events. However, this correlation is moderated by the no-show level: on flights with high no-show volumes, even large overbook counts produce no compensation. Business implication: Overbook count alone cannot be used as a compensation predictor without conditioning on the no-show buffer. This motivates the regression model in Technique 5.
No-Show ↔︎ Empty Seats (positive). More no-shows are associated with more empty seats at departure. This captures the other side of the trade-off: when no-shows exceed the overbook buffer, the aircraft departs with vacant capacity. Business implication: Routes where no-show rates are persistently high and not matched by a sufficient overbook factor are generating systematic revenue leakage. These routes should receive increased overbook limits.
Correlation ≠ causation: The positive correlation between overbook count and compensation does not imply that overbooking alone causes denial events — no-show behaviour is the mediating variable. A controlled trial (applying a zero-overbook policy on a subset of routes for one quarter) would be required to isolate the causal effect of the booking-limit decision. The correlation analysis motivates the regression model but cannot replace it.
9 Technique 5 — Linear Regression
Book reference: Chapter 9 (OLS) — coefficients, diagnostics, interpretation.
Business justification: As Airport Operations Manager, every compensation event I manage at the gate generates a cascading sequence of costs — passenger re-accommodation, ground delay, complaint records, and sometimes social media exposure. The regression model does not serve as a real-time gate tool: no-show counts are only finalised at check-in counter closure, and once they are known the exact number of passengers requiring compensation is determined directly by the absorption rule (compensate = max(overbook − no_show, 0)), making a regression prediction redundant at that point. The model’s operational value lies upstream, in the policy-setting decision: the regression coefficients quantify — for Revenue Management — the expected compensation burden generated by each additional oversold seat, controlling for aircraft type and seasonal no-show variation. The outputs (Table 12 and Figure 5) allow the overbook limit to be set in advance at the level where expected compensation cost is minimised relative to lost-revenue risk, before the flight ever opens for sale.
9.1 Model specification
The outcome is compensate (passengers requiring compensation per departure). Predictors are:
overbook — the primary policy variable
no_show — the natural absorber of overbook exposure
ac_type — aircraft type (proxy for route scale and booking-management practice)
month_num — captures seasonal no-show variation
seat_sold and flown are excluded due to algebraic collinearity with overbook and capacity.
Each additional seat oversold increases predicted compensation by β₁ passengers — set the overbook limit below the point where this prediction crosses your compensation-cost threshold
no_show
−
Each additional no-show reduces predicted compensation by
ac_type [Medium]
±
If medium aircraft shows a significantly different intercept, it requires a separately calibrated overbook limit, not a scaled version of the large-aircraft limit
ac_type [Small]
±
Small-aircraft services operate on thinner no-show buffers; a statistically significant coefficient justifies a conservative, independently-set limit
month_num
±
A significant positive coefficient means compensation risk rises as the year progresses; a negative coefficient means risk is front-loaded — either pattern implies that a flat annual overbook rate fails to minimise the combined impact of empty seats and denied boardings
Model limitations: OLS assumes normally distributed, continuous residuals with homoscedastic variance. Because compensate is a non-negative integer count (zero-inflated), these assumptions are violated. The OLS model is used here as specified by the assignment (Ch. 9) and provides interpretable directional coefficients; however, Section 11 identifies a Zero-Inflated Negative Binomial GLM as the preferred modelling approach for operational deployment.
10 Integrated Findings
The five techniques collectively answer the central question — what is the recommended overbooking threshold? — from five complementary angles:
Technique
Key Finding
Recommended OB Rate Implication
EDA
No-show mean (5.56) exceeds overbook mean (3.85); 625 flights still departed with empty seats
The current overbook factor is below the no-show mean on most routes — there is room to increase it without raising compensation frequency
Visualisation
Compensation risk is concentrated on specific routes (>20% frequency) and in specific month × aircraft combinations
A uniform network-wide factor fails these segments; a differentiated matrix is required
Hypothesis Testing
Net exposure and compensation probability differ significantly across aircraft types and months
Statistically justifies separate overbook factors by aircraft type and a monthly limit calendar
Correlation
Net exposure (Overbook − NoShow) is the dominant driver of compensation; no-show rate is the dominant driver of empty seats
The recommended threshold balances these two forces; it is the point where expected net exposure ≈ 0 for the median no-show on that route
Regression
Compensation is a predictable linear function of overbook count, no-show count, aircraft type, and month
The regression coefficients provide a closed-form formula for the break-even overbook count; Table 12 (Estimated Low-Risk Overbook Levels by Aircraft Type) quantifies this per aircraft type
Single recommendation: Implement a route × aircraft-type × month overbook matrix where the target overbook count per cell is set at the 80th percentile of the historical no-show distribution for that cell. The 80th percentile is selected for three reasons: (i)asymmetric costs — denied boarding incurs compensation payments, NCAA regulatory exposure, and measurable brand damage, whereas an empty seat costs only marginal lost revenue, making a higher threshold preferable to the median; (ii)operational tolerance — internal stakeholder consultation indicates that a compensation event rate below 20% per departure is operationally manageable for gate teams and acceptable to the revenue management function; (iii)empirical fit — the dataset shows that on 80% of observed departures, no-shows equalled or exceeded 4 seats, meaning an 80th-percentile threshold translates to a concrete, immediately actionable overbook count rather than a theoretical ideal. Setting the target at the 80th percentile therefore ensures the overbook buffer is absorbed on 80% of all departures (eliminating empty seats on 80% of flights), while compensation events occur only on the 20% of flights where no-shows fall below that threshold. The regression model informs this pre-flight policy decision — it is not a gate-close tool. Once check-in closes and the final no-show count is known, the required number of re-accommodations is determined directly by max(overbook − no_show, 0); at that point, gate teams initiate the voluntary off-load announcement based on that direct count, not on a model prediction.
11 Limitations & Further Work
OLS for zero-inflated count data. Both overbook and compensate are non-negative integers. OLS does not respect the integer lower bound, can predict negative values, and its residuals violate normality assumptions under zero-inflation. A Zero-Inflated Negative Binomial GLM — which separately models the probability of any compensation event and the count conditional on an event occurring — would materially improve out-of-sample prediction accuracy for the rare but high-impact flights where compensate > 5.
No revenue or cost data. The dataset does not contain the fare paid per passenger, the cost of compensation under Nigerian Civil Aviation Authority regulations, or the marginal cost of carrying an empty seat. The analysis minimises compensation event count, not net revenue impact. Integrating booking-system fare data and compensation cost rates would enable a full expected-profit model where the recommended overbook rate maximises E[Revenue from oversold seats] − E[Compensation cost].
Single year of data. The analysis covers January–December 2025 only. Seasonal patterns (e.g., the peak-month no-show surge) cannot be distinguished from year-specific events (fuel crises, public-holiday calendar shifts, competitor entry). Three or more years of data would enable reliable seasonal decomposition and trend identification.
No causal identification. The correlation between overbook count and compensation is observational. Routes with high overbook counts may also operate in high-demand corridors where passengers are less likely to no-show (business travellers vs. leisure), confounding the comparison. A difference-in-differences design — comparing compensation rates before and after a deliberate policy change on a subset of routes — would isolate the causal effect of the overbook limit.
Route-level heterogeneity not fully modelled. The regression uses aircraft type and month as group-level predictors, but individual routes have distinct demand profiles, no-show cultures, and booking lead times. A multilevel (mixed-effects) model with route as a random effect would better partition within-route and between-route variation, producing route-specific overbook recommendations with appropriate uncertainty intervals.
12 References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Agu-Uzochukwu, O. (2025). Nigerian domestic airline flight overbooking and demand data — Overbook Report 2025 [Dataset]. Revenue Management Department, Nigerian Domestic Airline, Lagos, Nigeria. Data available on request from the author.
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Claude (Anthropic), an AI coding assistant, was used in this project to: (i) generate the initial structure of this Quarto document and scaffold R code chunks for each of the five required techniques; (ii) suggest appropriate statistical tests given the data structure (e.g., recommending Kruskal-Wallis over ANOVA after EDA confirmed non-normal count distributions); (iii) draft narrative text templates for the business interpretation sections; and (iv) construct the Python conversion script that transformed the binary .xlsb source file into the .csv format loaded by R.
All analytical decisions — which variables to treat as the outcome and predictors, which aircraft grouping to apply, how to interpret the net exposure concept, what the recommended overbooking threshold means in operational terms, and how to frame the trade-off between empty seats and passenger compensation — were made independently by the author based on direct operational experience as an Airport Operations Manager at a Nigerian domestic airline and on the course material from Prof Bongo Adi’s Data Analytics 1 programme at Lagos Business School. The author verified every result against the actual rendered output and takes full responsibility for all interpretations and conclusions presented in this document.
The underlying dataset was extracted directly from the airline’s Departure Control System. No data was fabricated, simulated, or otherwise altered.