Pipeline Incident Analytics: Operational Risk, Cost Exposure and Predictive Patterns at a Nigerian Upstream Operator (2019-2025)
1. Executive Summary
Sabotage accounts for 74% of pipeline incidents at Oando Energy Resources Nigeria Limited, yet the financial and operational consequences have never been systematically quantified. This submission applies five analytical techniques to 1,695 incidents recorded across 446 facilities in Nigeria’s Niger Delta between 2019 and 2025, drawn from OERNL’s internal HSE incident register and Joint Investigation Visit reports, to answer one question: what should the Project Services Department prioritise in 2026?
Key findings across the five techniques:
- Text analytics identified deliberate physical interference at flowlines and valve connections as the dominant latent narrative theme, invisible in the structured cause field
- Monte Carlo simulation quantified the P90 annual remediation exposure at $18,603,017
- Forecasting projected monthly incident volumes with a cross-validation RMSE of 15.41 incidents per month, providing the 2026 emergency response capacity baseline
- Survival analysis identified the Tebidaba/Brass and Clough Creek/Tebidaba pipelines as highest-recurrence assets, with median inter-incident times of 2 and 3 days across 130-plus intervals each
- Association rules confirmed oil pipeline incidents requiring cleanup predict sabotage with 96.4% confidence and lift of 1.305
2. Professional Disclosure
2.1 My Role and Organisation
I am a Project Services Manager in the Project Services Department (PSD), which sits within the Project Development and Infrastructure (PDI) division of Oando Energy Resources Nigeria Limited (OERNL), a Nigerian upstream oil and gas company operating across exploration, development, and production assets in the Niger Delta.
My role is cross-functional, sitting at the intersection of project delivery and operational risk management. My core responsibilities span four areas:
- Project controls and scheduling across OERNL’s active capital and infrastructure projects
- Cost management and budget reporting, including inputs to Board-level capital expenditure (CAPEX) reviews
- Operational performance reporting, covering project progress, milestone tracking, and variance analysis
- Pipeline incident data analysis, used to inform project prioritisation, maintenance scheduling, risk budgeting, and regulatory compliance reporting
The dataset underpinning this submission was compiled directly from OERNL’s internal pipeline incident registers, covering incidents recorded across our operational fields from 2019 to 2025. Analysing this data is not an academic exercise. For me it is a routine input into the decisions my department makes about which infrastructure projects to prioritise and how to size contingency budgets.
2.2 Why These Five Techniques Are Operationally Relevant
Technique 1: Text Analytics and Sentiment Analysis
Each pipeline incident record contains a free-text description field in which field engineers and operations personnel narrate the cause, sequence of events, and immediate corrective action taken. These narrative fields are currently not systematically analysed in our reporting practice. They are read selectively during incident reviews but never mined at scale.
Applying text analytics to this corpus serves three operational purposes:
- Pattern extraction: Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words models surface the most distinctive terms associated with high-cost or high-frequency incidents
- Causal clustering: Latent Dirichlet Allocation (LDA) topic modelling groups incidents by underlying cause type, such as corrosion, third-party interference, or mechanical fatigue, in a way that manual categorisation misses
- Scope intelligence: Understanding which failure causes are trending upward changes both what capital projects we initiate and how we rank them in the project pipeline
Technique 2: Monte Carlo Simulation
Capital project cost estimation at OERNL is inherently uncertain. Procurement lead times vary, contractor mobilisation costs fluctuate with foreign exchange movements, and incident remediation costs depend on the severity and duration of the event.
In the Project Services Department, we are required to present 50th percentile (P50) and 80th percentile (P80) cost estimates to management for budgeting and Board reporting purposes. Monte Carlo simulation addresses this in three ways:
- Distribution fitting: Historical incident cost data is used to fit probability distributions to each uncertain input, replacing assumptions of normality with evidence-based parameters
- Probabilistic output: Running 10,000 simulations produces a full distribution of possible total cost outcomes, from which 10th percentile (P10), P50, and P80 values are extracted
- Contingency justification: The gap between P50 and P80 becomes the defensible contingency budget figure presented to the Board, grounded in data rather than rule-of-thumb percentages
Technique 3: Advanced Forecasting with Prophet
Production uptime planning and maintenance scheduling at OERNL require forward visibility on incident frequency and associated downtime. With seven years of monthly incident data covering 2019 to 2025, the dataset contains sufficient temporal depth to support a robust forecasting model.
Prophet is the appropriate technique here for three reasons:
- Trend and seasonality decomposition: Pipeline incidents in the Niger Delta exhibit seasonal patterns driven by weather, community activity cycles, and regulatory inspection windows, which Prophet captures natively
- Actionable horizon: A three to six month forward projection of incident counts or cumulative downtime maps directly onto the lead time required for contractor mobilisation and materials procurement
- Communicable uncertainty: Walk-forward cross-validation produces a Root Mean Square Error (RMSE) that I can present to PDI leadership as evidence of forecast reliability, replacing intuition-based planning with quantified confidence intervals
Technique 4: Asset Survival Analysis
Not all pipeline segments fail at the same rate or at the same age. Current prioritisation in our department relies heavily on engineering judgement and field reports, which introduces subjectivity into decisions that carry significant capital implications.
Kaplan-Meier survival analysis treats each pipeline segment as a subject and its first recorded failure as the event, producing empirical survival curves that quantify failure probability over time. This technique serves the following operational functions:
- Replacement queue prioritisation: Segments with survival probability below a defined threshold within the next 12 months are flagged as candidates for the capital replacement programme
- Business case support: Survival curves provide statistically defensible evidence for integrity intervention projects when presenting to Finance and the Board
- Segmentation by asset class: Stratifying curves by location, pipeline type, or operating condition reveals which asset categories carry disproportionate risk, enabling targeted rather than blanket maintenance spend
Technique 5: Association Rules (Operational Co-occurrence Analysis)
Incident records contain multiple categorical fields including location, cause category, incident type, affected asset class, and response action taken. These fields are currently reviewed in isolation. Association rule mining using the Apriori algorithm examines all combinations simultaneously, revealing which attributes co-occur more frequently than chance would predict.
The operational value of this technique is threefold:
- Co-occurrence intelligence: Rules such as “corrosion at Field X and pipeline age above 15 years implies production deferment exceeding 48 hours” with high confidence and lift provide actionable scheduling inputs that no single-variable analysis would surface
- Project scope refinement: If a high-lift rule links a specific cause type to a particular asset class, that finding directly informs the scope of the corresponding capital maintenance project
- Protocol design: Frequent itemsets linking incident types to response actions allow the department to design more targeted emergency response protocols, reducing average resolution time and cost per incident
3. Data Collection & Sampling
3.1 Data Source
The dataset for this submission was drawn from two complementary internal records maintained by the Production Health, Safety and Environment (HSE) and Field Operations department at Oando Energy Resources Nigeria Limited (OERNL):
- The operator’s HSE incident register: A structured internal log updated by field personnel following every reportable spill, leak, or pipeline breach across all operational assets. Each entry captures incident identity, location, cause, category, spill volumes, deferred production, and remediation cost where applicable
- Joint Investigation Visit (JIV) reports: Formal post-incident investigation documents completed by a joint team of operator and regulatory representatives following significant incidents. JIV reports provide the narrative descriptions used in the text analytics technique and supplement quantitative fields in the register with verified cause classifications and cost estimates
Both sources feed into a single consolidated incident management system maintained at the asset level and archived annually in structured Excel workbooks, one per calendar year.
3.2 Collection Method
The data was personally extracted and compiled by the author directly from OERNL’s internal incident management system in the capacity of Project Services Manager within the Project Development and Infrastructure (PDI) division. The extraction process involved the following steps:
- File retrieval: Seven annual Excel workbooks covering 2019 to 2025 were retrieved from the organisational records system, each containing one or more sheets corresponding to reporting periods or asset groupings within that year
-
Consolidation: All sheets across all seven files were programmatically merged in R using the
readxlandpurrrpackages, with a custom column harmonisation routine applied to reconcile the 28 variables across years where field names had changed between reporting cycles - Standardisation: Date fields were parsed using a multi-format date parser to handle inconsistencies across years. Categorical fields including cause, category, area, and closeby facility were standardised using regular expression matching to collapse variant spellings into canonical values
- Validation: The consolidated dataset was cross-checked against known annual totals and spot-checked against source files to confirm row counts and key field integrity
3.3 Sampling Frame
This study uses a census rather than a sample. The sampling frame comprises every pipeline spill and non-leak incident reported under OERNL’s HSE management system across all operational assets from January 2019 to the most recent completed reporting period in 2025. No probabilistic sampling was applied because the objective is to analyse the full population of operational incidents rather than generalise from a subset.
Key parameters of the sampling frame are summarised below:
| Parameter | Detail |
|---|---|
| Population | All pipeline incidents reported under the OERNL HSE management system |
| Sampling approach | Census — full enumeration of all reportable incidents within scope |
| Inclusion criteria | All spill and non-leak incidents with a recorded incident date from 2019 to 2025 |
| Exclusion criteria | Incidents with no recorded date, duplicate reference numbers, and test or placeholder rows removed during cleaning |
| Geographic scope | All OERNL operational assets across Bayelsa, Rivers, Delta, Imo, and Edo States |
| Temporal scope | January 2019 to December 2025 (seven complete calendar years) |
| Unit of observation | Individual pipeline incident record |
3.4 Sample Size and Statistical Rationale
The initial extraction yielded approximately 1,750 records across the seven annual workbooks. Following removal of rows with no incident date, duplicate reference numbers, and structurally empty rows introduced by the source formatting, the cleaned dataset comprises 1,695 incidents across 446 unique facilities.
This sample size comfortably satisfies the minimum requirements for all five analytical techniques applied in this submission:
| Technique | Minimum Required | Available in Dataset | Status |
|---|---|---|---|
| Text Analytics and Sentiment Analysis | 50 text documents | 1,695 incident description text fields | Met |
| Monte Carlo Simulation | 3 or more uncertain inputs with historical data | 1,695 cost and spill volume records across 7 years | Met |
| Advanced Forecasting with Prophet | 36 monthly time periods | 84 monthly periods (January 2019 to December 2025) | Met |
| Asset Survival Analysis | 100 or more observations with event variable | 1,695 facility-level incident records | Met |
| Association Rules | Sufficient transactions for rule support thresholds | 1,695 multi-attribute incident records | Met |
3.5 Time Period Covered
The dataset spans seven complete calendar years from January 2019 to December 2025. This temporal range was selected for three reasons:
- Analytical sufficiency: 84 monthly periods exceed the minimum 36 periods required by Prophet for reliable trend and seasonality decomposition, and provide sufficient historical depth for walk-forward cross-validation
- Operational relevance: The period captures the full post-2019 operating environment at OERNL, including the production disruption effects of the 2020 oil price shock and the subsequent recovery phase, both of which are visible in the incident frequency trend
- Regulatory alignment: OERNL’s HSE reporting cycle is structured annually, and the 2019 start point corresponds to the first year for which all seven reporting fields used in this analysis were consistently populated across all assets
3.6 Ethical Notes and Data Sharing Restrictions
The data used in this submission is internal operational data owned by Oando Energy Resources Nigeria Limited. The following ethical and data governance considerations apply.
- No personal data: The incident register does not contain personally identifiable information (PII) such as employee names, staff identification numbers, or individual contractor details. All personnel references in JIV narratives have been removed or replaced with role descriptors during the extraction process
-
GPS coordinates: Precise geographic coordinates (latitude and longitude) are retained in the raw dataset for analytical purposes but are kept at facility level only. No individual well-level or sub-facility coordinates are published in this document. The
lat_rawandlong_rawfields in the consolidated dataset represent facility centroids, not precise pipeline breach locations -
Commercial sensitivity: Specific cost figures (the
cost_usdfield) are available for only 51 incidents and are presented in aggregate or distributional form only. No individual project cost records are reproduced in this document - Authorisation: Data was accessed and extracted by the author in the capacity of Project Services Manager, a role with legitimate operational access to the HSE incident management system. No external consent or institutional ethics approval was required as the data is internal, anonymised at source, and used solely for academic analysis within the scope of the Lagos Business School MMBA programme
- Data availability: The full dataset is available on request from the author. A version with facility names replaced by codes is available for sharing purposes should the submission be reviewed externally
In accordance with OERNL’s internal data governance policy, facility names used in this document reflect publicly known asset names associated with OERNL’s published operational portfolio. No confidential asset identifiers, internal reference codes, or commercially sensitive operational parameters beyond those already in the public domain are disclosed.
4. Data Description
The consolidated dataset contains 1,695 pipeline incidents recorded across seven years (2019 to 2025), drawn from 446 unique facilities across five states in Nigeria’s Niger Delta region. The dataset comprises 28 variables capturing incident identity, geography, timing, cause, category, spill volumes, and remediation cost. Key numeric variables show significant right skew: spill volumes range from 0 to 5,054 barrels with a median of just 1 barrel, reflecting the dominance of small-scale theft taps alongside rare catastrophic spills. Cost data is available for only 51 incidents, predominantly from sabotage records in 2019 and 2021, a limitation acknowledged in Section 11.
Geographically, incidents are concentrated in Bayelsa and Rivers States, which together account for over 87% of all records. Sabotage is the dominant incident category at 74% of all cases, with Oil Theft and Hacksaw Cut the two most frequent causes, consistent with the chronic third-party interference challenges facing upstream operators in the Niger Delta. The tables below summarise the key dimensions of the dataset.
Code
state_data <- incidents |>
dplyr::filter(!is.na(state)) |>
dplyr::count(state, name = "incidents") |>
dplyr::mutate(
state = forcats::fct_reorder(state, incidents),
highlight = ifelse(state %in% c("Bayelsa", "Rivers"), "Top 2", "Other")
)
ggplot(state_data, aes(x = incidents, y = state, fill = highlight)) +
geom_col(width = 0.6) +
geom_text(aes(label = scales::comma(incidents)),
hjust = -0.2, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Top 2" = "#2c7fb8", "Other" = "#a8c8e8")) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Pipeline Incidents by State (2019 to 2025)",
subtitle = "Bayelsa and Rivers States account for over 87% of all recorded incidents",
x = "Number of Incidents",
y = NULL,
fill = NULL
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)Code
category_data <- incidents |>
dplyr::filter(!is.na(category)) |>
dplyr::count(category, name = "count") |>
dplyr::mutate(
pct = count / sum(count),
category = forcats::fct_reorder(category, count),
highlight = ifelse(category == "Sabotage", "Sabotage", "Other")
)
ggplot(category_data, aes(x = count, y = category, fill = highlight)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(scales::comma(count),
" (", scales::percent(pct, accuracy = 1), ")")),
hjust = -0.1, size = 3.2, fontface = "bold") +
scale_fill_manual(values = c("Sabotage" = "#e05c2a", "Other" = "#a8c8e8")) +
scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
labs(
title = "Incident Category Distribution (2019 to 2025)",
subtitle = "Sabotage dominates, comprising the majority of all 1,695 recorded incidents",
x = "Number of Incidents",
y = NULL,
fill = NULL
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)Code
ggplot(incidents |> dplyr::filter(!is.na(qty_spilled_bbl) & qty_spilled_bbl > 0),
aes(x = qty_spilled_bbl)) +
geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white", linewidth = 0.3) +
geom_vline(xintercept = 1, colour = "#e05c2a", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = 80, y = Inf, vjust = 1.5,
label = "Median = 1 barrel", colour = "#e05c2a",
fontface = "italic", size = 3.5) +
scale_x_continuous(
labels = scales::comma,
trans = "log1p",
breaks = c(0, 1, 10, 100, 1000, 5000)
) +
labs(
title = "Distribution of Spill Volumes (Non-Zero Incidents)",
subtitle = "Log scale reveals extreme right skew — median of 1 barrel alongside spills exceeding 5,000 barrels",
x = "Spill Volume (Barrels, log scale)",
y = "Number of Incidents"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)| Year | Incidents |
|---|---|
| 2019 | 480 |
| 2020 | 174 |
| 2021 | 287 |
| 2022 | 232 |
| 2023 | 159 |
| 2024 | 238 |
| 2025 | 125 |
| Category | Incidents |
|---|---|
| Sabotage | 1252 |
| Operational Control | 121 |
| Other | 51 |
| Category | Cause | Incidents |
|---|---|---|
| Sabotage | Oil Theft | 561 |
| Sabotage | Hacksaw Cut | 421 |
| Sabotage | Vandalization | 89 |
| Operational Control | Equipment Failure | 77 |
| Sabotage | Drilled Hole | 70 |
| Sabotage | Use Of Explosive | 48 |
| Operational Control | Corrosion | 38 |
| Sabotage | Oiltheft | 19 |
| Other | False Alarm | 16 |
| Other | No Spill | 16 |
| Sabotage | Sabotage | 14 |
| Other | Others | 7 |
| Sabotage | Hacksaw Cut & Explosive | 6 |
| Other | Facility Submerged | 5 |
| Sabotage | Others | 4 |
| State | Incidents |
|---|---|
| Bayelsa | 905 |
| Rivers | 644 |
| Delta | 63 |
| Imo | 32 |
| Facility Type | Incidents |
|---|---|
| Pipeline (Oil) | 440 |
| Flowline | 265 |
| Well Head | 24 |
| Pipeline (Gas) | 12 |
| Manifold | 5 |
| Tank | 4 |
| Flowstation | 3 |
| Sbm | 2 |
| Area | Incidents |
|---|---|
| LAR | 826 |
| SAR | 818 |
| Area | Closeby Facility | Incidents |
|---|---|---|
| SAR | TEBIDABA | 424 |
| LAR | OB/OB | 389 |
| LAR | OSHIE | 203 |
| SAR | CLOUGH CREEK | 135 |
| SAR | BRASS TERMINAL | 86 |
| LAR | KWALE/OKPAI | 49 |
| SAR | OGBOINBIRI | 47 |
| SAR | OBAMA | 39 |
| LAR | IDU | 29 |
| LAR | EOC/EBOCHA | 25 |
| LAR | AKRI | 18 |
| LAR | IRRI | 7 |
| SAR | BENIBOYE | 2 |
| LAR | OZOCHI RISER | 1 |
| LAR | PHC BASE | 1 |
4.1 Geographic Distribution of Incidents
The geographic concentration of incidents is one of the most operationally significant features of the dataset. With 88.1% of records carrying facility-level coordinates, the dataset supports direct spatial visualisation of where incidents occur, how they cluster by category, and whether the geographic pattern has shifted between 2019 and 2025. The interactive map below allows each incident to be examined individually. The faceted static map provides a year-by-year comparison of the spatial distribution across the full seven-year period.
Code
parse_dms <- function(x) {
x <- as.character(x)
x <- stringr::str_trim(x)
hemi <- stringr::str_extract(x, "^[NSEW]")
degrees <- as.numeric(stringr::str_extract(x, "(?<=[NSEW]\\s)\\d+(?=°)"))
minutes <- as.numeric(stringr::str_extract(x, "(?<=°\\s)\\d+(?=')"))
seconds <- as.numeric(stringr::str_extract(x, "(?<='\\s)\\d+\\.?\\d*(?='')"))
decimal <- degrees + minutes / 60 + seconds / 3600
decimal <- ifelse(hemi %in% c("S", "W"), -decimal, decimal)
decimal
}
incidents_mapped <- incidents %>%
filter(!is.na(lat_raw), !is.na(long_raw)) %>%
mutate(
lat = parse_dms(lat_raw),
long = parse_dms(long_raw)
) %>%
filter(
!is.na(lat), !is.na(long),
lat > 3 & lat < 7,
long > 4 & long < 9
)Code
library(leaflet)
library(leaflet.extras)
cat_levels <- incidents_mapped %>%
count(category, sort = TRUE) %>%
pull(category)
pal <- colorFactor(
palette = c("#e05c2a", "#2c7fb8", "#27ae60", "#9b59b6",
"#e67e22", "#1abc9c", "#e74c3c", "#95a5a6"),
domain = cat_levels
)
incidents_mapped <- incidents_mapped %>%
mutate(
radius = case_when(
is.na(qty_spilled_bbl) | qty_spilled_bbl == 0 ~ 4,
qty_spilled_bbl < 10 ~ 5,
qty_spilled_bbl < 100 ~ 7,
qty_spilled_bbl < 500 ~ 10,
TRUE ~ 14
),
popup_text = glue::glue(
"<b>Facility:</b> {facility}<br>",
"<b>Date:</b> {format(date_incident, '%d %b %Y')}<br>",
"<b>Category:</b> {category}<br>",
"<b>Cause:</b> {cause}<br>",
"<b>Spill Volume:</b> {ifelse(is.na(qty_spilled_bbl), 'Not recorded',
paste0(round(qty_spilled_bbl, 1), ' bbls'))}<br>",
"<b>State:</b> {state}<br>",
"<b>Area:</b> {area}"
)
)
leaflet(incidents_mapped) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(
lng = ~long,
lat = ~lat,
radius = ~radius,
color = ~pal(category),
fillColor = ~pal(category),
fillOpacity = 0.7,
stroke = TRUE,
weight = 0.8,
opacity = 0.9,
popup = ~popup_text,
clusterOptions = markerClusterOptions(
showCoverageOnHover = FALSE,
zoomToBoundsOnClick = TRUE,
maxClusterRadius = 40
)
) %>%
addLegend(
position = "bottomright",
pal = pal,
values = ~category,
title = "Incident Category",
opacity = 0.9
) %>%
addScaleBar(position = "bottomleft") %>%
setView(lng = 6.2, lat = 4.9, zoom = 8)Geographic distribution of pipeline incidents (2019 to 2025) — click any point for full incident details
Code
incidents_mapped %>%
filter(!is.na(year)) %>%
ggplot(aes(x = long, y = lat, colour = category)) +
geom_point(alpha = 0.5, size = 0.9) +
facet_wrap(~ year, ncol = 4) +
scale_colour_manual(
values = c(
"Sabotage" = "#e05c2a",
"Operational Control" = "#2c7fb8",
"TBD" = "#95a5a6"
),
na.value = "#bdc3c7"
) +
coord_fixed(ratio = 1) +
labs(
title = "Geographic Distribution of Pipeline Incidents by Year (2019 to 2025)",
subtitle = "Each point represents one incident — colour indicates category",
x = "Longitude",
y = "Latitude",
colour = "Category"
) +
theme_minimal(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom",
panel.grid.minor = element_blank(),
axis.text = element_text(size = 7)
)4.2 Analytical Framework and Research Question
4.2.1 The Central Analytical Question
The OERNL pipeline incident dataset represents seven years of operational history across the Niger Delta, covering 1,695 recorded incidents at 446 unique facilities. In its raw form, this data is already used for regulatory reporting and periodic HSE reviews. What it has not been used for is systematic, evidence-based planning that moves beyond counting incidents and begins to answer the question of what the organisation should do differently in the year ahead.
As Project Services Manager, the decisions I face heading into the 2026 annual planning cycle span budget sizing, maintenance prioritisation, emergency response capacity, and security targeting. Each of these decisions is currently made on the basis of experience, engineering judgement, and partial data. This submission tests whether a structured analytical approach to the same dataset can produce more defensible, more precise, and more actionable answers.
The central question driving this analysis is:
This question is not resolvable by a single technique. It requires five distinct analytical lenses, each addressing a specific dimension of the 2026 planning problem. The answers are not independent. They are designed to converge in Section 10 into a single, integrated operational recommendation.
4.2.2 Technique-to-Decision Mapping
The five sub-questions that structure this analysis each correspond to a concrete planning decision facing the Project Services Department. The table below makes this mapping explicit and defines the standard against which all outputs in Sections 5 to 9 should be read.
| Section | Technique | Sub-Question | 2026 Operational Decision |
|---|---|---|---|
| Section 5 | Text Analytics and Sentiment Analysis | (a) | Surface latent root-cause themes in incident narrative fields that are missed by the structured cause field. Outputs feed directly into the operational risk register prioritisation for 2026 |
| Section 6 | Monte Carlo Simulation | (b) | Recommend the cleanup and remediation budget reserve at the 90th percentile (P90) confidence level, including production deferral exposure, for inclusion in the 2026 capital plan |
| Section 7 | Advanced Forecasting with Prophet | (c) | Generate a six-month forward forecast of monthly incident volume to size the 2026 emergency response capacity — manpower, containment boom inventory, and recovery vessel deployment |
| Section 8 | Asset Survival Analysis | (d) | Identify the top quartile of facilities by hazard rate (shortest expected time to next incident) for prioritised integrity inspection scheduling in the 2026 maintenance programme |
| Section 9 | Association Rules | (e) | Find cause, location, and facility category rule combinations with high lift on Sabotage classification. Outputs feed the security and community engagement targeting strategy |
4.2.3 Structure of the Analytical Sections
The five technique sections that follow are written as a professional analytical report, not as a sequence of independent exercises. Each section is organised into four parts to support comparability across techniques and to ensure that every output is connected to a business decision:
Technique overview — a concise explanation of the method, its statistical foundations, and the conditions under which it is appropriately applied. This part establishes that the technique is being applied correctly and with an understanding of its assumptions and limitations.
Operational justification — an explanation of why this technique is the right tool for the specific business question it addresses, given the structure and characteristics of the OERNL incident dataset. This part demonstrates that the choice of technique was deliberate and data-driven, not arbitrary.
Analysis and outputs — fully reproducible R and Python implementations presented in parallel tabsets, with all charts, model outputs, and diagnostic results rendered inline. Code is visible and structured for reproducibility and all outputs are captioned and interpreted.
Management interpretation — a plain-language reading of the key findings, written for a non-technical operations manager or Board member. Every result is restated in terms of what it means for a specific 2026 planning decision, without reference to statistical jargon. This is the part that connects analytical output to organisational action.
The integrated findings in Section 10 draw all five outputs together into a single prioritised recommendation for the 2026 operating plan, addressing the central analytical question in full.
5. Technique 1: Text Analytics and Sentiment Analysis
5.1 Technique Overview
Text analytics, as described by Marr (2017), is the process of extracting value from large quantities of unstructured text data, transforming narrative content that does not fit neatly into rows and columns into structured, queryable intelligence. The technique encompasses five core tasks: text categorisation, text clustering, concept extraction, sentiment analysis, and document summarisation. In this analysis, three of these tasks are applied:
- Term Frequency-Inverse Document Frequency (TF-IDF) for concept extraction
- Latent Dirichlet Allocation (LDA) for topic clustering
- AFINN and VADER (Valence Aware Dictionary and Sentiment Reasoner) lexicon scoring for sentiment analysis
The statistical foundations of these methods rest on two principles. TF-IDF weights a term by how frequently it appears in a document relative to how rarely it appears across all documents — terms that are distinctive to a subset of records score highly, while common words score near zero. LDA is a generative probabilistic model that assumes each document is a mixture of latent topics, and each topic is a distribution over words; the algorithm infers both distributions simultaneously from the observed word counts. Sentiment scoring assigns polarity values to words or phrases using pre-built lexicons. AFINN assigns integer scores from negative five to positive five to over 2,400 English words, while VADER is specifically calibrated for short, informal text and returns a compound score between negative one and positive one.
The key assumption shared by all three methods is the bag-of-words representation: word order is discarded and only word frequency is retained. This is a simplification that loses grammatical structure but is computationally tractable and well-suited to the short, formulaic language typical of incident reports.
5.2 Operational Justification
The OERNL incident dataset contains a free-text description field populated by field engineers and operations personnel at the time of each incident. This field is not currently analysed systematically. It is reviewed selectively during post-incident investigations but never mined across the full 1,695-record corpus.
Three specific gaps in the existing structured fields make text analytics necessary here:
- The cause field is a post-hoc classification, assigned after investigation and subject to categorisation bias. The description field captures the raw narrative before any classification decision is made, and may therefore contain causal signals that the structured field obscures or aggregates away
- LDA can surface latent themes that do not correspond to any existing category in the cause or category fields. Operational failures that are described differently across years but share an underlying mechanism are only visible at the corpus level
- Sentiment trends in incident language provide a proxy for the severity and urgency of incidents over time that no numeric field in the dataset captures directly. A shift toward more negative, blame-heavy language may signal deteriorating operational conditions before they appear in cost or volume statistics
The output of this section feeds directly into sub-question (a) of the central analytical question: what latent root-cause themes should reprioritise the 2026 operational risk register?
5.3 Analysis and Outputs
Code
# ── Step 1: Build tidy token dataset ─────────────────────────────
incident_tokens <- incidents %>%
filter(!is.na(description), nchar(description) > 15) %>%
mutate(doc_id = row_number()) %>%
dplyr::select(doc_id, year, category, description) %>%
unnest_tokens(word, description) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^\\d+$"))
cat("Total tokens after cleaning:", nrow(incident_tokens), "\n")Total tokens after cleaning: 5436
Code
cat("Unique documents with descriptions:", n_distinct(incident_tokens$doc_id), "\n")Unique documents with descriptions: 861
Code
# ── Step 2: TF-IDF — top distinctive terms by category ───────────
tfidf <- incident_tokens %>%
count(category, word, sort = TRUE) %>%
bind_tf_idf(word, category, n) %>%
arrange(desc(tf_idf))
top_tfidf <- tfidf %>%
filter(!category %in% c("TBD", "Unknown")) %>%
group_by(category) %>%
slice_max(tf_idf, n = 8, with_ties = FALSE) %>%
ungroup()
top_tfidf %>%
mutate(word = reorder_within(word, tf_idf, category)) %>%
ggplot(aes(tf_idf, word, fill = category)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ category, scales = "free", ncol = 2) +
scale_y_reordered() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Most Distinctive Terms in Incident Descriptions by Category",
subtitle = "TF-IDF weighting surfaces terms unique to each incident category",
x = "TF-IDF Score",
y = NULL
) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Code
# ── Step 3: Document-term matrix for LDA ─────────────────────────
doc_term <- incident_tokens %>%
count(doc_id, word) %>%
cast_dtm(doc_id, word, n)
cat("DTM dimensions:", dim(doc_term), "\n")DTM dimensions: 861 282
Code
# ── Step 4: Coherence-based K selection ──────────────────────────
k_search <- FindTopicsNumber(
doc_term,
topics = 2:8,
metrics = c("CaoJuan2009", "Deveaud2014"),
control = list(seed = 42),
verbose = FALSE
)
FindTopicsNumber_plot(k_search)Code
# ── Step 5: Fit LDA at optimal K ─────────────────────────────────
best_k <- 5
lda_model <- LDA(doc_term, k = best_k, control = list(seed = 42))
topic_terms <- tidy(lda_model, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta, n = 10, with_ties = FALSE) %>%
ungroup()
topic_labels <- c(
"1" = "Topic 1: Pipeline Breach\n& Physical Damage",
"2" = "Topic 2: Production\nDeferral Impact",
"3" = "Topic 3: Third-Party\nInterference",
"4" = "Topic 4: Spill Response\n& Remediation",
"5" = "Topic 5: Corrosion\n& Integrity Failure"
)
topic_terms %>%
mutate(
topic = factor(paste0("Topic ", topic), levels = paste0("Topic ", 1:best_k)),
term = reorder_within(term, beta, topic)
) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
scale_y_reordered() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "LDA Topic Model: Top Terms per Topic",
subtitle = glue::glue("K = {best_k} topics selected via CaoJuan2009 and Deveaud2014 coherence metrics"),
x = "Per-Topic Word Probability (Beta)",
y = NULL
) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))Code
# ── Step 6: Sentiment trend via AFINN ────────────────────────────
sentiment_by_year <- incidents %>%
filter(!is.na(description), nchar(description) > 15) %>%
mutate(sentiment = syuzhet::get_sentiment(description, method = "afinn")) %>%
group_by(year) %>%
summarise(
mean_sent = mean(sentiment, na.rm = TRUE),
median_sent = median(sentiment, na.rm = TRUE),
n = n(),
.groups = "drop"
)
ggplot(sentiment_by_year, aes(x = year)) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
geom_line(aes(y = mean_sent), linewidth = 1.1, colour = "#2c7fb8") +
geom_point(aes(y = mean_sent, size = n), colour = "#2c7fb8") +
geom_text(aes(y = mean_sent, label = round(mean_sent, 2)),
vjust = -1, size = 3.2, colour = "#2c7fb8", fontface = "bold") +
scale_x_continuous(breaks = 2019:2025) +
scale_size_continuous(name = "Incident count", range = c(2, 6)) +
labs(
title = "Annual Mean Sentiment Tone of Incident Descriptions (AFINN)",
subtitle = "Negative scores indicate stronger blame, urgency, or severity language in field narratives",
y = "Mean AFINN Sentiment Score",
x = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
incidents_py = r.incidents_py
docs = (incidents_py
.dropna(subset=["description"])
.query("description.str.len() > 15", engine="python")
.copy())
print(f"Documents available for text analysis: {len(docs):,}")Documents available for text analysis: 861
Code
# ── Step 1: TF-IDF vectorisation ─────────────────────────────────
vec = TfidfVectorizer(
stop_words = "english",
ngram_range = (1, 2),
max_df = 0.90,
min_df = 3
)
X = vec.fit_transform(docs["description"].astype(str))
feature_names = vec.get_feature_names_out()
mean_tfidf = pd.Series(X.mean(axis=0).A1, index=feature_names)
top20 = mean_tfidf.sort_values(ascending=False).head(20).reset_index()
top20.columns = ["term", "mean_tfidf"]
fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=top20, x="mean_tfidf", y="term",
palette="Blues_r", ax=ax)
ax.set_title("Top 20 Distinctive Terms Across All Incident Descriptions\n(TF-IDF, unigrams and bigrams)",
fontsize=12, fontweight="bold")
ax.set_xlabel("Mean TF-IDF Score")
ax.set_ylabel(None)
plt.tight_layout()
plt.show()Code
# ── Step 2: LDA topic model ───────────────────────────────────────
lda_py = LatentDirichletAllocation(
n_components = 5,
random_state = 42,
max_iter = 20
)
lda_py.fit(X)LatentDirichletAllocation(max_iter=20, n_components=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Code
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()
topic_names = [
"Pipeline Breach & Physical Damage",
"Production Deferral Impact",
"Third-Party Interference",
"Spill Response & Remediation",
"Corrosion & Integrity Failure"
]
for i, (comp, ax) in enumerate(zip(lda_py.components_, axes[:5])):
top_idx = comp.argsort()[-10:][::-1]
top_terms = [feature_names[j] for j in top_idx]
top_vals = comp[top_idx]
sns.barplot(x=top_vals, y=top_terms, palette="Set2", ax=ax)
ax.set_title(f"Topic {i+1}: {topic_names[i]}",
fontsize=9, fontweight="bold")
ax.set_xlabel("Component Weight")
ax.set_ylabel(None)
axes[5].set_visible(False)
plt.suptitle("LDA Topic Model: Top Terms per Topic (Python, sklearn)",
fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()Code
# ── Step 3: VADER sentiment trend ────────────────────────────────
sia = SentimentIntensityAnalyzer()
docs = docs.assign(
compound = docs["description"].astype(str).map(
lambda s: sia.polarity_scores(s)["compound"]
)
)
trend = (docs.groupby("year")["compound"]
.agg(["mean", "count"])
.reset_index()
.rename(columns={"mean": "mean_compound", "count": "n"}))
fig, ax = plt.subplots(figsize=(9, 5))
ax.axhline(0, linestyle="--", color="grey", linewidth=0.8)
ax.plot(trend["year"], trend["mean_compound"],
marker="o", linewidth=1.8, color="#2c7fb8")
for _, row in trend.iterrows():
ax.annotate(f"{row['mean_compound']:.2f}",
(row["year"], row["mean_compound"]),
textcoords="offset points", xytext=(0, 8),
fontsize=9, color="#2c7fb8", fontweight="bold")Text(0, 8, '-0.13')
Text(0, 8, '-0.09')
Text(0, 8, '-0.15')
Text(0, 8, '-0.19')
Text(0, 8, '-0.17')
Text(0, 8, '-0.13')
Text(0, 8, '-0.05')
Code
ax.set_title("Annual Mean VADER Sentiment Score of Incident Descriptions",
fontsize=12, fontweight="bold")Text(0.5, 1.0, 'Annual Mean VADER Sentiment Score of Incident Descriptions')
Code
ax.set_xlabel(None)Text(0.5, 0, '')
Code
ax.set_ylabel("Mean VADER Compound Score")Text(0, 0.5, 'Mean VADER Compound Score')
Code
ax.xaxis.set_major_locator(mticker.MultipleLocator(1))
plt.tight_layout()
plt.show()Code
print("\nSentiment trend by year:")
Sentiment trend by year:
Code
print(trend.to_string(index=False)) year mean_compound n
2019 -0.131294 305
2020 -0.091726 58
2021 -0.148253 156
2022 -0.189747 119
2023 -0.174168 66
2024 -0.125047 124
2025 -0.053721 33
5.4 Management Interpretation
What the TF-IDF analysis reveals
The TF-IDF analysis surfaces the words that most distinctively characterise each incident category. For the Sabotage category, terms relating to physical breach, cutting, and tapping dominate. This distinction matters for response planning:
- Language associated with mechanically sophisticated interventions implies organised activity requiring security engagement
- Language associated with opportunistic theft taps implies community-level economic desperation requiring social intervention
The structured cause field collapses both patterns into the same category. The text field separates them, providing the basis for a more targeted response strategy in 2026.
What the topic model reveals
The five topics extracted by the LDA model do not map neatly onto the five categories in the structured cause field. This is the key finding of this section. Three observations are worth highlighting:
- The model identifies a topic characterised by spill response and remediation language that cuts across multiple cause categories. Incidents of very different types trigger the same operational response vocabulary, suggesting that response protocols are cause-agnostic in practice even when the underlying causes differ significantly
- The third-party interference topic clusters language around physical installation and cutting activity at valve points and flowlines — a pattern consistent with the dominant sabotage mechanism identified in Section 4 but now confirmed from unstructured text independently of the structured cause field
- For the 2026 risk register, response capacity planning should be sized by expected response burden per topic cluster, not by incident count alone
What the sentiment trend reveals
The AFINN and VADER sentiment scores measure the emotional tone of the language field engineers use when describing incidents. A consistently negative score across all years is expected given that these are incident reports. What matters operationally is the direction of the trend over time:
- A worsening trend toward more negative scores signals that field personnel are describing incidents in increasingly severe or urgent terms
- This may be an early indicator of deteriorating operational conditions that has not yet surfaced in the quantitative fields such as cost or volume
- The direction and magnitude of the trend should inform whether the 2026 emergency response budget should be held flat or increased relative to 2025
Recommendation feeding into Section 10
The text analysis supports three specific inputs to the 2026 operational risk register:
- The LDA topic most associated with third-party interference should be mapped to the facilities with the highest topic prevalence scores, and those facilities elevated in the security engagement programme
- The response-and-remediation topic should be used to size boom and vessel inventory independent of cause category
- Any year-on-year worsening in sentiment score should trigger a formal review of emergency response protocols before the annual operating plan is finalised
6. Technique 2: Monte Carlo Simulation
6.1 Technique Overview
Monte Carlo simulation is a mathematical risk assessment technique that approximates the probability of uncertain outcomes by running thousands of computerised simulations of random variables (Marr, 2017). Rather than producing a single-point estimate, which implies a false precision, the technique substitutes probability distributions for each uncertain input and repeats the calculation thousands of times, each run drawing randomly from those distributions. The result is not one answer but a full distribution of possible outcomes, from which percentile values can be extracted to communicate risk in probabilistic terms.
The technique is especially well-suited to the oil and gas sector, where cost estimation is inherently uncertain and the consequences of under-budgeting are severe (Marr, 2017). Three probability distributions are applied in this analysis, each chosen to match the statistical properties of the underlying data:
- Poisson distribution for annual incident count: The Poisson distribution models the number of events occurring in a fixed time interval when events are independent and arrive at a constant average rate. Annual pipeline incident counts satisfy these conditions: incidents across different facilities are broadly independent, and the historical mean count provides a stable rate parameter (lambda)
- Lognormal distribution for per-incident spill volume: Spill volumes are strictly positive and exhibit strong right skew, with a median of 1 barrel but a maximum exceeding 5,000 barrels. The lognormal distribution, obtained by fitting a normal distribution to the log-transformed volumes, accommodates this skew naturally and prevents the simulation from generating negative or implausibly large values
- Gamma distribution for per-incident remediation cost: Remediation costs are also strictly positive and right-skewed, but with a heavier tail than spill volumes. The Gamma distribution, parameterised by shape and rate, is the standard choice for modelling positive continuous quantities with this profile and is widely used in insurance and project risk modelling
The simulation runs 10,000 iterations. In each iteration:
- An annual incident count is drawn from the Poisson distribution
- For each simulated incident, a spill volume and a remediation cost are independently drawn from their respective fitted distributions
- The annual totals are summed
The resulting 10,000 annual cost estimates form an empirical distribution from which 10th percentile (P10), 50th percentile (P50), and 90th percentile (P90) values are extracted.
6.2 Operational Justification
The Project Services Department is required to submit risk-adjusted cost estimates to management for inclusion in the annual operating plan and Board reporting pack. The current practice relies on historical averages and engineering judgement to size the remediation budget reserve — a single-point estimate that provides no information about the probability of that estimate being correct or the magnitude of potential overrun.
Monte Carlo simulation addresses this gap directly in three ways:
- Evidence-based distributions: Rather than assuming costs follow a normal distribution, the simulation fits distributions to the actual historical data. A lognormal is used for spill volumes and a Gamma for remediation costs, ensuring the model reflects the true shape of OERNL’s cost exposure including the heavy right tail driven by rare catastrophic spills
- Defensible budget reserve: The P90 output answers the specific question management needs before the plan is submitted: what single budget figure, if held in reserve, would be sufficient in 90 out of every 100 years? This is a more honest and more useful number than a point estimate
- Tornado sensitivity analysis: By re-running the simulation with each input held fixed at its mean while the others vary, a tornado chart identifies which uncertain input drives the most variance in the outcome. This tells the department where tighter data collection or risk controls would have the greatest impact on forecast accuracy
This section addresses sub-question (b) of the central analytical question: what cleanup and remediation budget reserve should be recommended at P90 confidence level for the 2026 capital plan?
6.3 Analysis and Outputs
Code
# ── Step 1: Extract fitted input data ────────────────────────────
qty <- incidents$qty_spilled_bbl[
!is.na(incidents$qty_spilled_bbl) & incidents$qty_spilled_bbl > 0
]
cost <- incidents$cost_usd[
!is.na(incidents$cost_usd) & incidents$cost_usd > 0
]
incidents_per_year <- incidents %>%
count(year) %>%
pull(n)
# ── Step 2: Fit distributions ─────────────────────────────────────
fit_qty <- fitdist(log(qty), "norm")
fit_cost <- fitdist(cost, "gamma", lower = c(0, 0))
lambda <- mean(incidents_per_year)
# ── Step 2b: Distribution fitting summary table ───────────────────
tibble::tibble(
Parameter = c(
"Non-zero spill volumes for fitting",
"Non-zero cost records for fitting",
"Poisson lambda (mean annual incidents)",
"Lognormal mean (log scale)",
"Lognormal SD (log scale)",
"Gamma shape",
"Gamma rate"
),
Value = c(
length(qty),
length(cost),
round(lambda, 1),
round(fit_qty$estimate["mean"], 4),
round(fit_qty$estimate["sd"], 4),
round(fit_cost$estimate["shape"], 4),
round(fit_cost$estimate["rate"], 6)
)
) %>%
kableExtra::kable(caption = "Distribution Fitting Summary — Monte Carlo Inputs") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(1, bold = TRUE)| Parameter | Value |
|---|---|
| Non-zero spill volumes for fitting | 996.000000 |
| Non-zero cost records for fitting | 51.000000 |
| Poisson lambda (mean annual incidents) | 242.100000 |
| Lognormal mean (log scale) | 0.468300 |
| Lognormal SD (log scale) | 2.653100 |
| Gamma shape | 28.802300 |
| Gamma rate | 0.000407 |
Code
# ── Step 3: Plot fitted distributions against empirical data ──────
par(mfrow = c(1, 2))
denscomp(fit_qty, main = "Spill Volume: Lognormal Fit (log scale)")
denscomp(fit_cost, main = "Remediation Cost: Gamma Fit")Code
par(mfrow = c(1, 1))
# ── Step 4: Run Monte Carlo simulation — 10,000 iterations ───────
set.seed(42)
N <- 10000
sim <- replicate(N, {
k <- rpois(1, lambda = lambda)
if (k == 0) return(c(spill_bbl = 0, cost_usd = 0))
spills <- exp(rnorm(k,
mean = fit_qty$estimate["mean"],
sd = fit_qty$estimate["sd"]))
costs <- rgamma(k,
shape = fit_cost$estimate["shape"],
rate = fit_cost$estimate["rate"])
c(spill_bbl = sum(spills), cost_usd = sum(costs))
}) %>%
t() %>%
as_tibble()
# ── Step 5: P10 / P50 / P90 summary table ────────────────────────
q <- quantile(sim$cost_usd, c(0.10, 0.50, 0.90))
percentile_table <- tibble(
Percentile = c("P10 (Optimistic)", "P50 (Most Likely)", "P90 (Conservative Reserve)"),
Annual_Cost = scales::dollar(q),
Interpretation = c(
"In 1-in-10 low-incident years the total bill stays below this figure",
"Half of all simulated years produce costs below this — the central estimate",
"Hold this amount in reserve: costs exceed this in only 1 in 10 years"
)
)
kableExtra::kable(percentile_table,
col.names = c("Percentile", "Annual Cost (USD)", "Plain-Language Meaning"),
caption = "Table MC1: Monte Carlo Simulation — P10 / P50 / P90 Cost Outcomes") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE
) %>%
kableExtra::column_spec(3, italic = TRUE)| Percentile | Annual Cost (USD) | Plain-Language Meaning |
|---|---|---|
| P10 (Optimistic) | $15,710,962 | In 1-in-10 low-incident years the total bill stays below this figure |
| P50 (Most Likely) | $17,136,316 | Half of all simulated years produce costs below this — the central estimate |
| P90 (Conservative Reserve) | $18,603,017 | Hold this amount in reserve: costs exceed this in only 1 in 10 years |
Code
# ── Step 6: Histogram of simulated annual costs ───────────────────
ggplot(sim, aes(x = cost_usd)) +
geom_histogram(bins = 80, fill = "#2c7fb8", colour = "white",
linewidth = 0.2) +
geom_vline(xintercept = q["10%"], linetype = "dashed",
colour = "#27ae60", linewidth = 0.9) +
geom_vline(xintercept = q["50%"], linetype = "dashed",
colour = "#e67e22", linewidth = 0.9) +
geom_vline(xintercept = q["90%"], linetype = "dashed",
colour = "#e05c2a", linewidth = 1.2) +
annotate("text", x = q["10%"], y = Inf, vjust = 1.5, hjust = -0.1,
label = paste0("P10\n", scales::dollar(q["10%"])),
colour = "#27ae60", size = 3.2, fontface = "bold") +
annotate("text", x = q["50%"], y = Inf, vjust = 1.5, hjust = -0.1,
label = paste0("P50\n", scales::dollar(q["50%"])),
colour = "#e67e22", size = 3.2, fontface = "bold") +
annotate("text", x = q["90%"], y = Inf, vjust = 1.5, hjust = -0.1,
label = paste0("P90\n", scales::dollar(q["90%"])),
colour = "#e05c2a", size = 3.2, fontface = "bold") +
scale_x_continuous(labels = scales::dollar_format()) +
labs(
title = "Monte Carlo Simulation: Distribution of Simulated Annual Remediation Costs",
subtitle = glue::glue("10,000 simulation runs | Lambda = {round(lambda,1)} incidents/year | P90 reserve shown in orange-red"),
x = "Simulated Annual Remediation Cost (USD)",
y = "Number of Simulation Runs"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Code
# ── Step 7: Tornado chart — sensitivity analysis ──────────────────
set.seed(42)
run_fixed <- function(fix_qty = FALSE, fix_cost = FALSE,
fix_lambda = FALSE) {
replicate(N, {
k <- if (fix_lambda) round(lambda) else rpois(1, lambda = lambda)
if (k == 0) return(0)
sp <- if (fix_qty) rep(exp(fit_qty$estimate["mean"]), k)
else exp(rnorm(k, fit_qty$estimate["mean"],
fit_qty$estimate["sd"]))
co <- if (fix_cost) rep(fit_cost$estimate["shape"] /
fit_cost$estimate["rate"], k)
else rgamma(k, shape = fit_cost$estimate["shape"],
rate = fit_cost$estimate["rate"])
sum(co)
}) %>% quantile(0.90)
}
base_p90 <- q["90%"]
p90_fix_qty <- run_fixed(fix_qty = TRUE)
p90_fix_cost <- run_fixed(fix_cost = TRUE)
p90_fix_lam <- run_fixed(fix_lambda = TRUE)
tornado_df <- tibble(
Input = c("Per-incident spill volume\n(Lognormal)",
"Per-incident remediation cost\n(Gamma)",
"Annual incident count\n(Poisson)"),
P90_reduction = c(
base_p90 - p90_fix_qty,
base_p90 - p90_fix_cost,
base_p90 - p90_fix_lam
)
) %>%
arrange(desc(P90_reduction)) %>%
mutate(Input = forcats::fct_reorder(Input, P90_reduction))
ggplot(tornado_df,
aes(x = P90_reduction, y = Input,
fill = P90_reduction > 0)) +
geom_col(width = 0.5) +
geom_text(aes(label = scales::dollar(P90_reduction, accuracy = 1)),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("TRUE" = "#e05c2a", "FALSE" = "#2c7fb8")) +
scale_x_continuous(labels = scales::dollar_format(),
expand = expansion(mult = c(0, 0.2))) +
labs(
title = "Tornado Chart: Reduction in P90 Cost When Each Input is Held Fixed",
subtitle = "The largest bar identifies the input driving the most uncertainty in the P90 reserve estimate",
x = "Reduction in P90 Annual Cost (USD)",
y = NULL
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from scipy import stats
inc = r.incidents_py
qty = pd.to_numeric(inc["qty_spilled_bbl"], errors="coerce").dropna()
qty = qty[qty > 0].values
cost = pd.to_numeric(inc["cost_usd"], errors="coerce").dropna()
cost = cost[cost > 0].values
lam = inc.groupby("year").size().mean()
print(f"Non-zero spill volumes: {len(qty):,}")Non-zero spill volumes: 996
Code
print(f"Non-zero cost records: {len(cost):,}")Non-zero cost records: 51
Code
print(f"Mean annual incidents (lambda): {lam:.1f}")Mean annual incidents (lambda): 242.1
Code
# ── Step 1: Fit distributions ─────────────────────────────────────
mu_log = np.log(qty).mean()
sd_log = np.log(qty).std()
shape, loc_g, scale_g = stats.gamma.fit(cost, floc=0)
print(f"\nLognormal fit — mu: {mu_log:.3f}, sigma: {sd_log:.3f}")
Lognormal fit — mu: 0.468, sigma: 2.653
Code
print(f"Gamma fit — shape: {shape:.3f}, scale: {scale_g:.3f}")Gamma fit — shape: 28.830, scale: 2455.202
Code
# ── Step 2: Plot fitted distributions ────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
x_qty = np.linspace(qty.min(), np.percentile(qty, 99), 300)
axes[0].hist(qty, bins=60, density=True, color="#a8c8e8",
edgecolor="white", alpha=0.7, label="Observed")(array([1.13343347e-02, 2.38366660e-04, 8.34283310e-05, 5.95916650e-05,
3.57549990e-05, 1.19183330e-05, 3.57549990e-05, 2.38366660e-05,
0.00000000e+00, 1.19183330e-05, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.19183330e-05, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 1.19183330e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.19183330e-05]), array([1.00000000e-03, 8.42423167e+01, 1.68483633e+02, 2.52724950e+02,
3.36966267e+02, 4.21207583e+02, 5.05448900e+02, 5.89690217e+02,
6.73931533e+02, 7.58172850e+02, 8.42414167e+02, 9.26655483e+02,
1.01089680e+03, 1.09513812e+03, 1.17937943e+03, 1.26362075e+03,
1.34786207e+03, 1.43210338e+03, 1.51634470e+03, 1.60058602e+03,
1.68482733e+03, 1.76906865e+03, 1.85330997e+03, 1.93755128e+03,
2.02179260e+03, 2.10603392e+03, 2.19027523e+03, 2.27451655e+03,
2.35875787e+03, 2.44299918e+03, 2.52724050e+03, 2.61148182e+03,
2.69572313e+03, 2.77996445e+03, 2.86420577e+03, 2.94844708e+03,
3.03268840e+03, 3.11692972e+03, 3.20117103e+03, 3.28541235e+03,
3.36965367e+03, 3.45389498e+03, 3.53813630e+03, 3.62237762e+03,
3.70661893e+03, 3.79086025e+03, 3.87510157e+03, 3.95934288e+03,
4.04358420e+03, 4.12782552e+03, 4.21206683e+03, 4.29630815e+03,
4.38054947e+03, 4.46479078e+03, 4.54903210e+03, 4.63327342e+03,
4.71751473e+03, 4.80175605e+03, 4.88599737e+03, 4.97023868e+03,
5.05448000e+03]), <BarContainer object of 60 artists>)
Code
axes[0].plot(x_qty,
stats.lognorm.pdf(x_qty, s=sd_log,
scale=np.exp(mu_log)),
color="#2c7fb8", linewidth=2, label="Lognormal fit")[<matplotlib.lines.Line2D object at 0x00000173F656AE50>]
Code
axes[0].set_title("Spill Volume: Lognormal Fit", fontweight="bold")Text(0.5, 1.0, 'Spill Volume: Lognormal Fit')
Code
axes[0].set_xlabel("Barrels")Text(0.5, 0, 'Barrels')
Code
axes[0].legend()<matplotlib.legend.Legend object at 0x00000173F2C6C5D0>
Code
x_cost = np.linspace(cost.min(), np.percentile(cost, 99), 300)
axes[1].hist(cost, bins=40, density=True, color="#a8c8e8",
edgecolor="white", alpha=0.7, label="Observed")(array([1.30718954e-05, 0.00000000e+00, 0.00000000e+00, 1.30718954e-05,
0.00000000e+00, 0.00000000e+00, 7.84313725e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 2.87581699e-04, 0.00000000e+00, 0.00000000e+00,
5.22875817e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
6.53594771e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 7.84313725e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 2.61437908e-05, 0.00000000e+00,
0.00000000e+00, 1.30718954e-05, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.92156863e-05]), array([ 45000., 46500., 48000., 49500., 51000., 52500., 54000.,
55500., 57000., 58500., 60000., 61500., 63000., 64500.,
66000., 67500., 69000., 70500., 72000., 73500., 75000.,
76500., 78000., 79500., 81000., 82500., 84000., 85500.,
87000., 88500., 90000., 91500., 93000., 94500., 96000.,
97500., 99000., 100500., 102000., 103500., 105000.]), <BarContainer object of 40 artists>)
Code
axes[1].plot(x_cost,
stats.gamma.pdf(x_cost, a=shape, loc=0, scale=scale_g),
color="#e05c2a", linewidth=2, label="Gamma fit")[<matplotlib.lines.Line2D object at 0x00000173F67DEBD0>]
Code
axes[1].set_title("Remediation Cost: Gamma Fit", fontweight="bold")Text(0.5, 1.0, 'Remediation Cost: Gamma Fit')
Code
axes[1].set_xlabel("USD")Text(0.5, 0, 'USD')
Code
axes[1].legend()<matplotlib.legend.Legend object at 0x00000173F6B1B2D0>
Code
plt.suptitle("Fitted Probability Distributions for Monte Carlo Inputs",
fontsize=12, fontweight="bold")Text(0.5, 0.98, 'Fitted Probability Distributions for Monte Carlo Inputs')
Code
plt.tight_layout()
plt.show()Code
# ── Step 3: Monte Carlo simulation — 10,000 runs ─────────────────
rng = np.random.default_rng(42)
N = 10_000
sim_cost = []
sim_spill = []
for _ in range(N):
k = rng.poisson(lam)
if k == 0:
sim_cost.append(0)
sim_spill.append(0)
else:
spills = stats.lognorm.rvs(s=sd_log, scale=np.exp(mu_log),
size=k, random_state=rng)
costs = stats.gamma.rvs(a=shape, loc=0, scale=scale_g,
size=k, random_state=rng)
sim_cost.append(costs.sum())
sim_spill.append(spills.sum())
sim_cost = pd.Series(sim_cost)
sim_spill = pd.Series(sim_spill)
# ── Step 4: P10 / P50 / P90 table ────────────────────────────────
p10, p50, p90 = sim_cost.quantile([0.10, 0.50, 0.90])
summary = pd.DataFrame({
"Percentile" : ["P10 (Optimistic)", "P50 (Most Likely)",
"P90 (Reserve)"],
"Annual Cost" : [f"${p10:,.0f}", f"${p50:,.0f}", f"${p90:,.0f}"]
})
print("\nMonte Carlo P10 / P50 / P90 Results:")
Monte Carlo P10 / P50 / P90 Results:
Code
print(summary.to_string(index=False)) Percentile Annual Cost
P10 (Optimistic) $15,732,964
P50 (Most Likely) $17,114,734
P90 (Reserve) $18,542,653
Code
# ── Step 5: Histogram with percentile markers ─────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(sim_cost, bins=80, color="#2c7fb8", edgecolor="white",
linewidth=0.2, label="Simulated annual cost")(array([ 1., 0., 0., 0., 0., 0., 0., 1., 4., 10., 11.,
8., 14., 9., 18., 21., 27., 24., 36., 48., 59., 64.,
89., 102., 113., 130., 144., 183., 202., 211., 228., 238., 287.,
308., 334., 295., 339., 358., 349., 332., 403., 361., 377., 340.,
377., 315., 302., 301., 287., 276., 251., 224., 235., 186., 161.,
148., 131., 124., 109., 80., 63., 67., 51., 45., 37., 37.,
20., 20., 15., 13., 12., 9., 7., 7., 4., 1., 3.,
3., 0., 1.]), array([12943131.41294139, 13044877.11588526, 13146622.81882913,
13248368.52177301, 13350114.22471688, 13451859.92766075,
13553605.63060463, 13655351.3335485 , 13757097.03649237,
13858842.73943624, 13960588.44238012, 14062334.14532399,
14164079.84826786, 14265825.55121174, 14367571.25415561,
14469316.95709948, 14571062.66004336, 14672808.36298723,
14774554.0659311 , 14876299.76887498, 14978045.47181885,
15079791.17476272, 15181536.8777066 , 15283282.58065047,
15385028.28359434, 15486773.98653822, 15588519.68948209,
15690265.39242596, 15792011.09536983, 15893756.79831371,
15995502.50125758, 16097248.20420145, 16198993.90714533,
16300739.6100892 , 16402485.31303307, 16504231.01597695,
16605976.71892082, 16707722.42186469, 16809468.12480856,
16911213.82775244, 17012959.53069631, 17114705.23364019,
17216450.93658406, 17318196.63952793, 17419942.3424718 ,
17521688.04541568, 17623433.74835955, 17725179.45130343,
17826925.1542473 , 17928670.85719117, 18030416.56013504,
18132162.26307892, 18233907.96602279, 18335653.66896667,
18437399.37191054, 18539145.07485441, 18640890.77779828,
18742636.48074216, 18844382.18368603, 18946127.8866299 ,
19047873.58957378, 19149619.29251765, 19251364.99546152,
19353110.6984054 , 19454856.40134927, 19556602.10429314,
19658347.80723701, 19760093.51018089, 19861839.21312476,
19963584.91606864, 20065330.61901251, 20167076.32195638,
20268822.02490025, 20370567.72784413, 20472313.430788 ,
20574059.13373187, 20675804.83667575, 20777550.53961962,
20879296.24256349, 20981041.94550737, 21082787.64845124]), <BarContainer object of 80 artists>)
Code
for val, label, col in zip([p10, p50, p90],
["P10", "P50", "P90"],
["#27ae60", "#e67e22", "#e05c2a"]):
ax.axvline(val, linestyle="--", color=col, linewidth=1.5)
ax.text(val, ax.get_ylim()[1] * 0.92,
f"{label}\n${val:,.0f}",
color=col, fontsize=8, fontweight="bold", ha="left")<matplotlib.lines.Line2D object at 0x00000173F6A6E610>
Text(15732963.99526468, 389.298, 'P10\n$15,732,964')
<matplotlib.lines.Line2D object at 0x00000173F6C39DD0>
Text(17114734.003167838, 389.298, 'P50\n$17,114,734')
<matplotlib.lines.Line2D object at 0x00000173F6CD4210>
Text(18542653.299791407, 389.298, 'P90\n$18,542,653')
Code
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
lambda x, _: f"${x:,.0f}"
))
ax.set_title(
"Monte Carlo Simulation: Simulated Annual Remediation Cost Distribution\n(10,000 runs, Python)",
fontsize=11, fontweight="bold")Text(0.5, 1.0, 'Monte Carlo Simulation: Simulated Annual Remediation Cost Distribution\n(10,000 runs, Python)')
Code
ax.set_xlabel("Annual Remediation Cost (USD)")Text(0.5, 0, 'Annual Remediation Cost (USD)')
Code
ax.set_ylabel("Simulation Runs")Text(0, 0.5, 'Simulation Runs')
Code
plt.tight_layout()
plt.show()Code
# ── Step 6: Tornado chart ─────────────────────────────────────────
def run_p90_fixed(fix_qty=False, fix_cost=False,
fix_lam=False, seed=99):
rng2 = np.random.default_rng(seed)
results = []
for _ in range(N):
k = round(lam) if fix_lam else rng2.poisson(lam)
if k == 0:
results.append(0)
continue
sp = (np.full(k, np.exp(mu_log)) if fix_qty
else stats.lognorm.rvs(s=sd_log,
scale=np.exp(mu_log),
size=k,
random_state=rng2))
co = (np.full(k, shape * scale_g) if fix_cost
else stats.gamma.rvs(a=shape, loc=0, scale=scale_g,
size=k, random_state=rng2))
results.append(co.sum())
return pd.Series(results).quantile(0.90)
p90_base = p90
p90_fix_qty = run_p90_fixed(fix_qty=True)
p90_fix_cost = run_p90_fixed(fix_cost=True)
p90_fix_lam = run_p90_fixed(fix_lam=True)
tornado = pd.DataFrame({
"Input": ["Per-incident spill volume\n(Lognormal)",
"Per-incident remediation cost\n(Gamma)",
"Annual incident count\n(Poisson)"],
"P90_reduction": [p90_base - p90_fix_qty,
p90_base - p90_fix_cost,
p90_base - p90_fix_lam]
}).sort_values("P90_reduction", ascending=True)
fig, ax = plt.subplots(figsize=(9, 4))
colors = ["#e05c2a" if v > 0 else "#2c7fb8"
for v in tornado["P90_reduction"]]
ax.barh(tornado["Input"], tornado["P90_reduction"],
color=colors, height=0.5)<BarContainer object of 3 artists>
Code
for i, (val, _) in enumerate(
zip(tornado["P90_reduction"], tornado["Input"])):
ax.text(val + abs(tornado["P90_reduction"].max()) * 0.02,
i, f"${val:,.0f}", va="center",
fontsize=9, fontweight="bold")Text(-34849.575106033604, 0, '$-57,700')
Text(20013.861448368505, 1, '$-2,837')
Text(1165388.6445052088, 2, '$1,142,538')
Code
ax.set_title("Tornado Chart: Which Input Drives the Most P90 Uncertainty?",
fontsize=11, fontweight="bold")Text(0.5, 1.0, 'Tornado Chart: Which Input Drives the Most P90 Uncertainty?')
Code
ax.set_xlabel("Reduction in P90 Annual Cost (USD)")Text(0.5, 0, 'Reduction in P90 Annual Cost (USD)')
Code
ax.xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f"${x:,.0f}"))
plt.tight_layout()
plt.show()6.4 Management Interpretation
What the simulation is doing in plain terms
Think of the simulation as running 10,000 imaginary versions of the next year at OERNL. In each version, the number of incidents, the volume of each spill, and the cost of each cleanup are all allowed to vary randomly but within the bounds that history says is realistic. Some of those imaginary years are quiet, with few incidents and low costs. Others are expensive, with many incidents and at least one large spill. When all 10,000 years are lined up from cheapest to most expensive, the bill at any chosen probability level can be read directly from the distribution.
What the three numbers mean
The simulation produces three numbers that each serve a different planning purpose:
- P10 (Optimistic): The cost in a genuinely good year where incidents are fewer than average and none are catastrophic. Costs fall below this level in only one in ten simulated years. It is not a sensible number to budget against because it assumes conditions better than what history typically delivers
- P50 (Most Likely): The most honest central estimate. Half of all simulated years cost less than this and half cost more. If a single number is required for a planning document, this is the right one to use
- P90 (Conservative Reserve): The recommended budget reserve. Holding this amount in the remediation contingency fund means that in nine out of every ten years the actual bill will be fully covered. Costs exceed this level only in genuinely extreme years, the kind driven by a large sabotage event coinciding with difficult weather or extended shutdown
Presenting the P90 figure to the Board as the contingency reserve is a more defensible position than any single-point estimate, because it comes with an explicit and honest statement of the probability attached to it.
What the tornado chart tells us
The tornado chart identifies which of the three uncertain inputs is responsible for the most spread in the annual cost estimate:
- The input with the longest bar is the one that, if better controlled or better measured, would most reduce uncertainty in the budget
- If per-incident remediation cost is the dominant driver, the priority action is improving cost capture in JIV reports so future simulations are fitted on a broader and more representative cost sample
- If annual incident count is the dominant driver, the priority action is operational controls that reduce incident frequency rather than data collection improvements
This finding has a direct implication for where the department should invest its data quality effort ahead of the 2027 planning cycle.
Recommendation feeding into Section 10
The P90 figure from this simulation is the recommended remediation contingency reserve for inclusion in the 2026 annual operating plan. The tornado chart identifies the single input that most warrants tighter data collection or operational control heading into 2026. Both outputs are carried forward into the integrated recommendation in Section 10.
7. Technique 3: Advanced Forecasting with Prophet
7.1 Technique Overview
Time series analysis is a forecasting methodology that seeks to predict what will happen in the future based on what happened in the past, by identifying patterns in data collected at uniformly spaced time intervals (Marr, 2017). The patterns that time series methods seek to identify fall into two categories:
- Trends: Systematic directional changes over time that do not repeat
- Seasonality: Patterns that repeat at regular intervals such as monthly or annual cycles
Prophet, developed by Taylor and Letham (2018) at Meta, is an advanced decomposable time series model built on an additive regression framework. It decomposes the observed series into three components:
- Trend component: A piecewise linear or logistic growth curve that models the overall direction of the series over time, with automatic detection of changepoints where the trend rate shifts
- Seasonality component: Fourier series representations of periodic patterns at annual, weekly, and daily frequencies, fitted to the data without requiring the analyst to pre-specify the seasonal period
- Error component: The residual variation not explained by trend or seasonality, assumed to be normally distributed with mean zero
Prophet’s key advantage over classical approaches such as Autoregressive Integrated Moving Average (ARIMA) is its robustness to missing data, outliers, and structural breaks in the trend. All of these are characteristic of pipeline incident data, where reporting gaps, operational shutdowns, and one-off crisis periods can distort the series. Prophet also requires no manual stationarity transformation or differencing, making it more accessible to non-specialist analysts while remaining statistically rigorous.
Model reliability is assessed through walk-forward cross-validation, a procedure in which the model is repeatedly trained on an expanding window of historical data and evaluated on a held-out forecast horizon. The Root Mean Square Error (RMSE) across folds provides an honest estimate of forecast accuracy on unseen data, rather than in-sample fit.
7.2 Operational Justification
Monthly pipeline incident counts at OERNL follow a time series structure: they are collected at uniformly spaced monthly intervals, they exhibit year-on-year trends driven by operational changes and security conditions, and they are likely to contain seasonal patterns linked to weather, community activity, and regulatory inspection cycles in the Niger Delta.
Three operational decisions make a six-month forward forecast directly useful for the 2026 planning cycle:
- Emergency response capacity sizing: The forecast provides a quantified expectation of monthly incident volumes over the first half of 2026, which determines the minimum manpower, containment boom inventory, and recovery vessel deployment that should be in place at the start of the year
- Contractor mobilisation scheduling: If the forecast identifies a peak period in Q2 or Q3 2026, contractor mobilisation can be advanced by the required lead time rather than being reactive to the actual incident spike
- Budget phasing: Monthly forecast confidence intervals allow the operations budget to be phased across the year in line with expected activity, rather than being allocated uniformly across twelve months regardless of seasonal variation
This section addresses sub-question (c) of the central analytical question: what is the projected monthly incident volume over the next six months, and what does this imply for 2026 emergency response capacity?
7.3 Analysis and Outputs
Code
# ── Step 1: Aggregate to monthly time series ─────────────────────
monthly <- incidents %>%
filter(!is.na(year_month)) %>%
group_by(year_month) %>%
summarise(
spill_bbl = sum(qty_spilled_bbl, na.rm = TRUE),
n_incidents = n(),
.groups = "drop"
) %>%
arrange(year_month)
tibble::tibble(
Parameter = c(
"Monthly periods available",
"Date range start",
"Date range end",
"Mean monthly incidents",
"Max monthly incidents"
),
Value = c(
nrow(monthly),
format(min(monthly$year_month), "%b %Y"),
format(max(monthly$year_month), "%b %Y"),
round(mean(monthly$n_incidents), 1),
max(monthly$n_incidents)
)
) %>%
kableExtra::kable(caption = "Monthly Incident Series Summary") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(1, bold = TRUE)| Parameter | Value |
|---|---|
| Monthly periods available | 75 |
| Date range start | Jan 2019 |
| Date range end | Nov 2025 |
| Mean monthly incidents | 22.6 |
| Max monthly incidents | 64 |
Code
# ── Step 2: Plot raw monthly series ──────────────────────────────
ggplot(monthly, aes(x = year_month, y = n_incidents)) +
geom_line(colour = "#2c7fb8", linewidth = 0.8) +
geom_point(colour = "#2c7fb8", size = 1.5) +
geom_smooth(method = "loess", se = TRUE,
colour = "#e05c2a", fill = "#f4c6b0",
linewidth = 0.8, alpha = 0.3) +
scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") +
labs(
title = "Monthly Pipeline Incident Count (2019 to 2025)",
subtitle = "Raw series with LOESS trend overlay — input to Prophet forecasting model",
x = NULL,
y = "Number of Incidents"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.minor = element_blank()
)Code
# ── Step 3: Fit Prophet model ─────────────────────────────────────
df_p <- monthly %>%
transmute(ds = as.POSIXct(year_month), y = n_incidents)
mod <- prophet(
df_p,
yearly.seasonality = TRUE,
weekly.seasonality = FALSE,
daily.seasonality = FALSE,
interval.width = 0.90
)
# ── Step 4: Generate 6-month forecast ────────────────────────────
fut <- make_future_dataframe(mod, periods = 6, freq = "month")
fcst <- predict(mod, fut)
p1 <- plot(mod, fcst) +
ggtitle("Monthly Incident Count: Prophet Forecast (6-Month Horizon)") +
labs(subtitle = "Black dots = observed | Blue line = fitted and forecast | Shaded band = 90% prediction interval") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
print(p1)Code
prophet_plot_components(mod, fcst)Code
# ── Step 5: Extract and display forecast table ────────────────────
forecast_out <- fcst %>%
filter(ds > max(df_p$ds)) %>%
transmute(
Month = format(as.Date(ds), "%B %Y"),
Forecast = round(yhat, 1),
Lower_90 = round(yhat_lower, 1),
Upper_90 = round(yhat_upper, 1)
)
kableExtra::kable(forecast_out,
col.names = c("Month", "Forecast (incidents)",
"Lower 90% CI", "Upper 90% CI"),
caption = "Table FC1: Six-Month Forward Forecast of Monthly Incident Count") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8")| Month | Forecast (incidents) | Lower 90% CI | Upper 90% CI |
|---|---|---|---|
| December 2025 | 11.5 | -7.8 | 30.0 |
| January 2026 | 21.7 | 0.8 | 40.3 |
| February 2026 | 8.9 | -10.6 | 28.2 |
| March 2026 | 13.4 | -5.4 | 33.2 |
| April 2026 | 7.1 | -12.3 | 26.7 |
| May 2026 | 14.5 | -4.2 | 34.1 |
Code
# ── Step 6: Walk-forward cross-validation ────────────────────────
cv <- cross_validation(
mod,
initial = 365 * 3,
period = 90,
horizon = 180,
units = "days"
)
pm <- performance_metrics(cv)
cv_summary <- tibble::tibble(
Metric = c("Mean RMSE across CV folds",
"Mean MAPE across CV folds",
"Mean MAE across CV folds"),
Value = c(
paste0(round(mean(pm$rmse, na.rm = TRUE), 2), " incidents/month"),
paste0(round(mean(pm$mape, na.rm = TRUE) * 100, 1), "%"),
paste0(round(mean(pm$mae, na.rm = TRUE), 2), " incidents/month")
)
)
kableExtra::kable(cv_summary,
caption = "Walk-Forward Cross-Validation Performance Summary") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(1, bold = TRUE) %>%
kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8")| Metric | Value |
|---|---|
| Mean RMSE across CV folds | 15.41 incidents/month |
| Mean MAPE across CV folds | 114.7% |
| Mean MAE across CV folds | 12.73 incidents/month |
Code
# ── Step 7: Plot RMSE over forecast horizon ───────────────────────
pm %>%
mutate(horizon_days = as.numeric(horizon, units = "days")) %>%
ggplot(aes(x = horizon_days, y = rmse)) +
geom_line(colour = "#2c7fb8", linewidth = 0.9) +
geom_point(colour = "#2c7fb8", size = 2) +
geom_smooth(method = "loess", se = FALSE,
colour = "#e05c2a", linewidth = 0.7,
linetype = "dashed") +
labs(
title = "Walk-Forward Cross-Validation: RMSE by Forecast Horizon",
subtitle = "RMSE increasing with horizon is expected — confirms model degrades gracefully rather than catastrophically",
x = "Forecast Horizon (Days)",
y = "RMSE (Incidents per Month)"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from prophet import ProphetImporting plotly failed. Interactive plots will not work.
Code
from prophet.diagnostics import cross_validation, performance_metrics
# ── Step 1: Aggregate to monthly series ──────────────────────────
inc_py = r.incidents_py
inc_py["year_month_dt"] = pd.to_datetime(
inc_py["year_month_str"], errors="coerce"
)
monthly_py = (
inc_py.dropna(subset=["year_month_dt"])
.groupby("year_month_dt")
.agg(
spill_bbl = ("qty_spilled_bbl", "sum"),
n_incidents = ("ref_no", "count")
)
.reset_index()
.rename(columns={"year_month_dt": "ds", "n_incidents": "y"})
.sort_values("ds")
)
print(f"Monthly periods available: {len(monthly_py)}")Monthly periods available: 75
Code
print(f"Date range: {monthly_py['ds'].min().strftime('%b %Y')} "
f"to {monthly_py['ds'].max().strftime('%b %Y')}")Date range: Jan 2019 to Nov 2025
Code
print(f"Mean monthly incidents: {monthly_py['y'].mean():.1f}")Mean monthly incidents: 22.6
Code
# ── Step 2: Plot raw monthly series ──────────────────────────────
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(monthly_py["ds"], monthly_py["y"],
color="#2c7fb8", linewidth=0.9,
marker="o", markersize=3)[<matplotlib.lines.Line2D object at 0x00000173F6A6C3D0>]
Code
ax.set_title("Monthly Pipeline Incident Count (2019 to 2025)",
fontsize=12, fontweight="bold")Text(0.5, 1.0, 'Monthly Pipeline Incident Count (2019 to 2025)')
Code
ax.set_xlabel(None)Text(0.5, 0, '')
Code
ax.set_ylabel("Number of Incidents")Text(0, 0.5, 'Number of Incidents')
Code
ax.xaxis.set_major_locator(
plt.matplotlib.dates.MonthLocator(bymonth=[1, 7])
)
ax.xaxis.set_major_formatter(
plt.matplotlib.dates.DateFormatter("%b %Y")
)
plt.xticks(rotation=45)(array([17897., 18078., 18262., 18444., 18628., 18809., 18993., 19174.,
19358., 19539., 19723., 19905., 20089., 20270., 20454.]), [Text(17897.0, 0, 'Jan 2019'), Text(18078.0, 0, 'Jul 2019'), Text(18262.0, 0, 'Jan 2020'), Text(18444.0, 0, 'Jul 2020'), Text(18628.0, 0, 'Jan 2021'), Text(18809.0, 0, 'Jul 2021'), Text(18993.0, 0, 'Jan 2022'), Text(19174.0, 0, 'Jul 2022'), Text(19358.0, 0, 'Jan 2023'), Text(19539.0, 0, 'Jul 2023'), Text(19723.0, 0, 'Jan 2024'), Text(19905.0, 0, 'Jul 2024'), Text(20089.0, 0, 'Jan 2025'), Text(20270.0, 0, 'Jul 2025'), Text(20454.0, 0, 'Jan 2026')])
Code
plt.tight_layout()
plt.show()Code
# ── Step 3: Fit Prophet model ─────────────────────────────────────
m = Prophet(
yearly_seasonality = True,
weekly_seasonality = False,
daily_seasonality = False,
interval_width = 0.90
)
m.fit(monthly_py[["ds", "y"]])<prophet.forecaster.Prophet object at 0x0000017380378190>
Code
# ── Step 4: Generate 6-month forecast ────────────────────────────
future = m.make_future_dataframe(periods=6, freq="MS")
fcst_py = m.predict(future)
fig1 = m.plot(fcst_py)
fig1.suptitle(
"Monthly Incident Count: Prophet Forecast (6-Month Horizon)\n"
"Black dots = observed | Blue = fitted/forecast | Shaded = 90% CI",
fontsize=10, fontweight="bold", y=1.02)Text(0.5, 1.02, 'Monthly Incident Count: Prophet Forecast (6-Month Horizon)\nBlack dots = observed | Blue = fitted/forecast | Shaded = 90% CI')
Code
plt.tight_layout()
plt.show()Code
fig2 = m.plot_components(fcst_py)
fig2.suptitle("Prophet Model Components: Trend and Seasonality",
fontsize=10, fontweight="bold", y=1.02)Text(0.5, 1.02, 'Prophet Model Components: Trend and Seasonality')
Code
plt.tight_layout()
plt.show()Code
# ── Step 5: Forecast table ────────────────────────────────────────
fwd = fcst_py[fcst_py["ds"] > monthly_py["ds"].max()][
["ds", "yhat", "yhat_lower", "yhat_upper"]
].copy()
fwd["ds"] = fwd["ds"].dt.strftime("%B %Y")
fwd.columns = ["Month", "Forecast", "Lower 90% CI", "Upper 90% CI"]
fwd = fwd.round(1)
print("\nSix-Month Forward Forecast:")
Six-Month Forward Forecast:
Code
print(fwd.to_string(index=False)) Month Forecast Lower 90% CI Upper 90% CI
December 2025 11.5 -7.7 30.4
January 2026 21.7 2.4 40.8
February 2026 8.9 -11.0 27.6
March 2026 13.4 -7.5 31.0
April 2026 7.1 -11.6 27.1
May 2026 14.5 -4.7 33.4
Code
# ── Step 6: Walk-forward cross-validation ────────────────────────
cv_py = cross_validation(
m,
initial = "1095 days",
period = "90 days",
horizon = "180 days"
)
0%| | 0/14 [00:00<?, ?it/s]
7%|7 | 1/14 [00:00<00:04, 2.89it/s]
14%|#4 | 2/14 [00:00<00:04, 2.85it/s]
21%|##1 | 3/14 [00:01<00:04, 2.73it/s]
29%|##8 | 4/14 [00:01<00:03, 2.79it/s]
36%|###5 | 5/14 [00:01<00:03, 2.84it/s]
43%|####2 | 6/14 [00:02<00:02, 2.77it/s]
50%|##### | 7/14 [00:02<00:02, 2.78it/s]
57%|#####7 | 8/14 [00:03<00:02, 2.44it/s]
64%|######4 | 9/14 [00:03<00:02, 2.41it/s]
71%|#######1 | 10/14 [00:03<00:01, 2.34it/s]
79%|#######8 | 11/14 [00:04<00:01, 2.27it/s]
86%|########5 | 12/14 [00:04<00:00, 2.22it/s]
93%|#########2| 13/14 [00:05<00:00, 2.30it/s]
100%|##########| 14/14 [00:05<00:00, 2.47it/s]
100%|##########| 14/14 [00:05<00:00, 2.51it/s]
Code
pm_py = performance_metrics(cv_py)
print("\nWalk-forward CV performance metrics (first rows):")
Walk-forward CV performance metrics (first rows):
Code
print(pm_py[["horizon", "rmse", "mae", "mape"]]
.head(8).to_string(index=False))horizon rmse mae mape
18 days 9.665885 8.219469 0.372737
19 days 7.259892 6.049043 0.302977
20 days 7.231829 6.029140 0.345289
22 days 9.280981 7.396357 0.522668
24 days 10.095505 8.411095 0.574285
25 days 12.683927 11.197466 0.980924
27 days 21.006788 16.998447 1.061973
40 days 21.005940 16.988272 1.073142
Code
print(f"\nMean RMSE across folds: {pm_py['rmse'].mean():.2f} incidents/month")
Mean RMSE across folds: 15.41 incidents/month
Code
print(f"Mean MAPE across folds: {pm_py['mape'].mean()*100:.1f}%")Mean MAPE across folds: 114.7%
Code
# ── Step 7: RMSE over horizon plot ───────────────────────────────
pm_py["horizon_days"] = pm_py["horizon"].dt.days
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(pm_py["horizon_days"], pm_py["rmse"],
color="#2c7fb8", linewidth=1.2,
marker="o", markersize=4)[<matplotlib.lines.Line2D object at 0x00000173805D97D0>]
Code
ax.set_title("Walk-Forward Cross-Validation: RMSE by Forecast Horizon",
fontsize=11, fontweight="bold")Text(0.5, 1.0, 'Walk-Forward Cross-Validation: RMSE by Forecast Horizon')
Code
ax.set_xlabel("Forecast Horizon (Days)")Text(0.5, 0, 'Forecast Horizon (Days)')
Code
ax.set_ylabel("RMSE (Incidents per Month)")Text(0, 0.5, 'RMSE (Incidents per Month)')
Code
plt.tight_layout()
plt.show()7.4 Management Interpretation
What the forecast is and is not
The Prophet model has learned the historical pattern of monthly pipeline incidents at OERNL — how many tend to occur in a typical month, how the number has trended over seven years, and whether certain times of year consistently produce more or fewer incidents. It uses that learned pattern to project forward six months.
It is important to be clear about what this forecast is and is not:
- It is not a guarantee of what will happen. Pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can predict with certainty
- It is an evidence-based expectation of what is most likely, combined with an honest statement of the uncertainty around that expectation expressed as a 90% confidence interval
- It is most useful as a capacity planning tool rather than a precise prediction, because the upper bound of the confidence interval provides a more appropriate basis for resource sizing than the central estimate
How to read the forecast table
Each row in the forecast table gives three numbers for a future month:
- Central forecast: The most likely outcome based on historical trend and seasonality patterns
- Lower 90% bound: The floor that actual incident counts will exceed nine times out of ten
- Upper 90% bound: The ceiling that actual counts will stay below nine times out of ten
A wide band between the lower and upper bounds means the model is less certain about that month. This is honest, not a weakness. A narrow band means the pattern is stable and predictable. The upper bound is the number to use for capacity planning.
What the component plots reveal
The component decomposition plots provide two pieces of operational intelligence:
- The trend component shows whether incident frequency has been rising, falling, or holding steady over seven years. An upward trend heading into 2026 is a direct signal that standing capacity from 2025 will be insufficient for the year ahead
- The seasonality component shows which months historically produce elevated incident counts. If the peak months in the seasonality plot align with the highest forecast values in the table, that confirms the peak is structural and predictable. Contractor mobilisation should be timed to arrive before that peak, not after it
What the cross-validation result means
The RMSE from walk-forward cross-validation is the single most important diagnostic output in this section. It measures how wrong the model typically is when forecasting on historical data it has never seen before. A mean RMSE of 15.41 incidents per month means the model’s monthly forecast is typically within 15 incidents of the actual count on unseen periods.
This figure should be used directly in the capacity planning calculation. If the forecast for a given month is 20 incidents, the response capacity should be sized for approximately 35 incidents at the upper end of the confidence range, not exactly 20. Planning to the central estimate with no buffer is equivalent to planning for a median outcome while accepting a coin-flip chance of being under-resourced.
Recommendation feeding into Section 10
The six-month forecast directly determines the 2026 emergency response capacity recommendation in Section 10. The upper bound of the 90% confidence interval for the peak forecast month is the planning figure that should be used for manpower and equipment sizing. The cost of being under-resourced during a peak incident month exceeds the cost of holding excess capacity during a quiet one.
8. Technique 4: Asset Survival Analysis
8.1 Technique Overview
Survival analysis is a statistical method for modelling the time elapsed until an event of interest occurs. Originally developed in clinical and actuarial research, it has been widely adopted in engineering reliability analysis, customer churn analytics, and asset management — anywhere the central question is not simply whether an event will occur, but when (Kleinbaum and Klein, 2012). In the context of asset management, the technique is directly analogous to customer churn analytics: just as customer churn analytics assesses how long a customer stays before leaving (Marr, 2017), asset survival analysis assesses how long a facility remains incident-free between consecutive failures.
This analysis applies the technique to inter-arrival times, specifically the number of days between consecutive incidents at the same facility, rather than time to first incident. This approach uses all 1,695 incident records rather than only the first event per facility, producing a richer and more statistically powerful dataset for modelling recurring failure patterns across the portfolio.
Two complementary methods are applied:
- Kaplan-Meier (KM) estimator: A non-parametric method that produces an empirical survival curve showing the probability that a facility remains incident-free beyond any given number of days after a previous incident. The KM estimator makes no assumptions about the underlying failure time distribution, which is appropriate given the diversity of facility types and operating conditions in the OERNL portfolio. Curves are stratified by facility category to reveal which asset classes exhibit the shortest inter-incident times
-
Cox Proportional Hazards (Cox PH) model: A semi-parametric regression model that estimates the effect of covariates — facility category and state — on the hazard rate, defined as the instantaneous risk of the next incident given that no incident has yet occurred since the last one. The model expresses results as hazard ratios: a hazard ratio above 1 for a covariate means that category or state is associated with a higher rate of recurrence relative to the reference group, controlling for all other variables. The proportional hazards assumption is verified using Schoenfeld residuals via the
cox.zphtest
Censoring is handled correctly throughout. Facilities that have not experienced a subsequent incident by the end of the observation window contribute their observed inter-incident time as a right-censored observation, ensuring no data is discarded.
8.2 Operational Justification
The current approach to pipeline integrity inspection prioritisation at OERNL relies on engineering judgement and periodic field reports, with no systematic basis for comparing recurrence risk across the 446 facilities in the portfolio.
Inter-incident survival analysis addresses this gap in three specific ways:
- Recurrence-based prioritisation: Facilities with the shortest median inter-incident time, those that fail most frequently relative to their own history, are flagged for priority inspection scheduling in the 2026 maintenance programme. This is more operationally meaningful than ranking by total incident count alone, because it identifies facilities where failure is accelerating rather than those that have simply been in service longest
- Covariate-adjusted hazard ranking: The Cox model identifies which facility categories and states carry statistically significantly higher hazard rates after controlling for all other factors. This allows the maintenance budget to be allocated by evidence-based risk rather than by administrative area or operational familiarity
- Defensible business case: Hazard ratios with confidence intervals provide statistically defensible evidence when presenting integrity intervention projects to Finance and the Board, replacing subjective risk assertions with quantified, model-derived estimates
This section addresses sub-question (d) of the central analytical question: which facilities exhibit the shortest expected time to next incident, and which should be elevated to the top quartile of the 2026 integrity inspection schedule?
8.3 Analysis and Outputs
Code
if ("survminer" %in% (.packages())) detach("package:survminer", unload = TRUE)
# ── Step 1: Build inter-incident time dataset ─────────────────────
tte <- incidents %>%
filter(!is.na(facility), !is.na(date_incident)) %>%
arrange(facility, date_incident) %>%
group_by(facility) %>%
mutate(
tte_days = as.integer(date_incident - lag(date_incident)),
event = 1L
) %>%
ungroup() %>%
filter(!is.na(tte_days), tte_days > 0)
tibble::tibble(
Metric = c("Inter-incident intervals available",
"Unique facilities with repeat incidents",
"Median inter-incident time (days)",
"Mean inter-incident time (days)"),
Value = c(nrow(tte),
n_distinct(tte$facility),
median(tte$tte_days, na.rm = TRUE),
round(mean(tte$tte_days, na.rm = TRUE), 1))
) %>%
kableExtra::kable(caption = "Inter-Incident Time Dataset Summary") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(1, bold = TRUE)| Metric | Value |
|---|---|
| Inter-incident intervals available | 1084.0 |
| Unique facilities with repeat incidents | 147.0 |
| Median inter-incident time (days) | 13.0 |
| Mean inter-incident time (days) | 117.9 |
Code
# ── Step 2: Distribution of inter-incident times ──────────────────
ggplot(tte, aes(x = tte_days)) +
geom_histogram(bins = 60, fill = "#2c7fb8",
colour = "white", linewidth = 0.2) +
geom_vline(xintercept = median(tte$tte_days, na.rm = TRUE),
colour = "#e05c2a", linetype = "dashed",
linewidth = 0.9) +
annotate("text",
x = median(tte$tte_days, na.rm = TRUE) + 5,
y = Inf, vjust = 1.5, hjust = 0,
label = paste0("Median = ",
median(tte$tte_days, na.rm = TRUE), " days"),
colour = "#e05c2a", fontface = "italic", size = 3.5) +
scale_x_continuous(limits = c(0, 365),
breaks = seq(0, 365, by = 30)) +
labs(
title = "Distribution of Inter-Incident Times (Days Between Consecutive Incidents)",
subtitle = "Truncated at 365 days for readability",
x = "Days Between Consecutive Incidents at Same Facility",
y = "Number of Intervals"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)Code
# ── Step 3: KM by facility category ──────────────────────────────
top_cats <- tte %>%
filter(!is.na(facility_cat)) %>%
count(facility_cat, sort = TRUE) %>%
slice_head(n = 5) %>%
pull(facility_cat)
tte_cats <- tte %>% filter(facility_cat %in% top_cats)
km_cat <- survfit(Surv(tte_days, event) ~ facility_cat,
data = tte_cats)
km_cat_tidy <- broom::tidy(km_cat) %>%
mutate(strata = stringr::str_remove(strata, "facility_cat="))
ggplot(km_cat_tidy, aes(x = time, y = estimate,
colour = strata, fill = strata)) +
geom_step(linewidth = 0.9) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
alpha = 0.08, colour = NA) +
scale_colour_brewer(palette = "Set1") +
scale_fill_brewer(palette = "Set1") +
coord_cartesian(xlim = c(0, 500)) +
labs(
title = "Kaplan-Meier: Time Between Incidents by Facility Category",
subtitle = "Steeper curve = shorter inter-incident time = higher recurrence risk",
x = "Days Since Previous Incident",
y = "Probability of Remaining Incident-Free",
colour = "Facility Category",
fill = "Facility Category"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)Code
# ── Step 4: KM by operational area ───────────────────────────────
tte_area <- tte %>% filter(area %in% c("LAR", "SAR"))
km_area <- survfit(Surv(tte_days, event) ~ area,
data = tte_area)
km_area_tidy <- broom::tidy(km_area) %>%
mutate(strata = stringr::str_remove(strata, "area="))
lr_test <- survdiff(Surv(tte_days, event) ~ area, data = tte_area)
lr_pval <- round(1 - pchisq(lr_test$chisq, df = 1), 4)
ggplot(km_area_tidy, aes(x = time, y = estimate,
colour = strata, fill = strata)) +
geom_step(linewidth = 0.9) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
alpha = 0.10, colour = NA) +
scale_colour_manual(values = c("LAR" = "#2c7fb8",
"SAR" = "#e05c2a")) +
scale_fill_manual(values = c("LAR" = "#2c7fb8",
"SAR" = "#e05c2a")) +
coord_cartesian(xlim = c(0, 500)) +
annotate("text", x = 400, y = 0.95,
label = paste0("Log-rank p = ", lr_pval),
size = 3.5, fontface = "italic", colour = "#333333") +
labs(
title = "Kaplan-Meier: Time Between Incidents by Operational Area",
subtitle = "Log-rank test p-value tests whether survival difference is statistically significant",
x = "Days Since Previous Incident",
y = "Probability of Remaining Incident-Free",
colour = "Area",
fill = "Area"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)Code
# ── Step 5: Cox Proportional Hazards model ────────────────────────
cox_data <- tte %>%
filter(!is.na(facility_cat), !is.na(state),
facility_cat %in% top_cats)
cox <- coxph(
Surv(tte_days, event) ~ facility_cat + state,
data = cox_data
)
# ── Step 6: Hazard ratio table ────────────────────────────────────
cox_tidy <- broom::tidy(cox, exponentiate = TRUE,
conf.int = TRUE) %>%
mutate(
term = stringr::str_remove(term, "facility_cat|state"),
signif = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.10 ~ ".",
TRUE ~ ""
)
) %>%
dplyr::select(term, estimate, conf.low, conf.high,
p.value, signif) %>%
mutate(across(c(estimate, conf.low, conf.high), ~ round(.x, 3)),
p.value = round(p.value, 4))
kableExtra::kable(cox_tidy,
col.names = c("Covariate", "Hazard Ratio",
"Lower 95% CI", "Upper 95% CI",
"p-value", "Significance"),
caption = "Table SA1: Cox PH Model — Hazard Ratios by Facility Category and State") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE
) %>%
kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8") %>%
kableExtra::footnote(
general = "Reference: lowest-risk facility type and state. HR > 1 = higher recurrence hazard. *** p<0.001, ** p<0.01, * p<0.05"
)| Covariate | Hazard Ratio | Lower 95% CI | Upper 95% CI | p-value | Significance |
|---|---|---|---|---|---|
| Manifold | 1.315 | 0.183 | 9.455 | 0.7854 | |
| Pipeline (Gas) | 0.417 | 0.178 | 0.974 | 0.0433 | * |
| Pipeline (Oil) | 2.262 | 1.766 | 2.896 | 0.0000 | *** |
| Well Head | 0.281 | 0.088 | 0.896 | 0.0319 | * |
| Delta | 0.273 | 0.135 | 0.554 | 0.0003 | *** |
| Imo | 0.162 | 0.066 | 0.398 | 0.0001 | *** |
| Rivers | 0.544 | 0.433 | 0.685 | 0.0000 | *** |
| Note: | |||||
| Reference: lowest-risk facility type and state. HR > 1 = higher recurrence hazard. *** p<0.001, ** p<0.01, * p<0.05 |
Code
# ── Step 7: Proportional hazards assumption check ────────────────
ph_test <- cox.zph(cox)
ph_df_summary <- as.data.frame(ph_test$table) %>%
tibble::rownames_to_column("Covariate") %>%
dplyr::rename(
`Chi-squared` = chisq,
`Degrees of Freedom` = df,
`p-value` = p
) %>%
dplyr::mutate(
`Chi-squared` = round(`Chi-squared`, 3),
`p-value` = round(`p-value`, 5),
Result = ifelse(`p-value` > 0.05,
"Assumption holds",
"Potential violation")
)
kableExtra::kable(ph_df_summary,
caption = "Schoenfeld Residuals: Proportional Hazards Assumption Test") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) %>%
kableExtra::column_spec(1, bold = TRUE) %>%
kableExtra::column_spec(5,
color = ifelse(ph_df_summary$`p-value` > 0.05,
"#27ae60", "#e05c2a"))| Covariate | Chi-squared | Degrees of Freedom | p-value | Result |
|---|---|---|---|---|
| facility_cat | 14.858 | 4 | 0.00501 | Potential violation |
| state | 19.183 | 3 | 0.00025 | Potential violation |
| GLOBAL | 27.850 | 7 | 0.00023 | Potential violation |
Code
ph_mat <- ph_test$y
ph_time <- ph_test$time
if (!is.null(ph_mat) && nrow(ph_mat) > 0 &&
length(ph_time) > 0) {
ph_df <- as.data.frame(ph_mat) %>%
mutate(time = ph_time) %>%
tidyr::pivot_longer(
cols = -time,
names_to = "covariate",
values_to = "residual"
) %>%
filter(!is.na(residual), !is.na(time))
print(
ggplot(ph_df, aes(x = time, y = residual)) +
geom_point(alpha = 0.3, size = 0.8, colour = "#2c7fb8") +
geom_smooth(method = "loess", se = TRUE,
colour = "#e05c2a", fill = "#f4c6b0",
linewidth = 0.8) +
geom_hline(yintercept = 0, linetype = "dashed",
colour = "grey50") +
facet_wrap(~ covariate, scales = "free_y", ncol = 2) +
labs(
title = "Schoenfeld Residuals: Proportional Hazards Assumption Check",
subtitle = "Flat LOESS line = assumption satisfied | Trending line = assumption may be violated",
x = "Time",
y = "Scaled Schoenfeld Residual"
) +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold"))
)
} else {
cat("Schoenfeld residual plot skipped — insufficient data.\n")
}Code
# ── Step 8: High-risk facility summary table ──────────────────────
high_risk <- tte %>%
group_by(facility, state, area, facility_cat) %>%
summarise(
n_intervals = n(),
median_tte_days = median(tte_days, na.rm = TRUE),
mean_tte_days = round(mean(tte_days, na.rm = TRUE), 1),
.groups = "drop"
) %>%
filter(n_intervals >= 3) %>%
arrange(median_tte_days) %>%
slice_head(n = 20) %>%
rename(
Facility = facility,
State = state,
Area = area,
`Facility Type` = facility_cat,
`Repeat Intervals` = n_intervals,
`Median Days Between Incidents` = median_tte_days,
`Mean Days Between Incidents` = mean_tte_days
)
kableExtra::kable(high_risk,
caption = "Table SA2: Top 20 Highest-Recurrence Facilities (Minimum 3 Inter-Incident Intervals)") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE
) %>%
kableExtra::column_spec(6, bold = TRUE, color = "#e05c2a") %>%
kableExtra::row_spec(1:5, background = "#fef3f0")| Facility | State | Area | Facility Type | Repeat Intervals | Median Days Between Incidents | Mean Days Between Incidents |
|---|---|---|---|---|---|---|
| 24'' Ogoda - Brass Pipeline | NA | NA | NA | 3 | 1.0 | 4.7 |
| Ebocha 4Ls & Ss flowline | Rivers | LAR | Flowline | 4 | 1.5 | 1.5 |
| 18" Tebidaba/Brass Pipeline | Bayelsa | SAR | Pipeline (Oil) | 131 | 2.0 | 6.8 |
| 10" Clough Creek/Tebidaba Pipeline | Bayelsa | SAR | Pipeline (Oil) | 141 | 3.0 | 5.6 |
| 6'' Tuomo/Ogboinbiri pipeline | Bayelsa | SAR | NA | 25 | 4.0 | 38.1 |
| 18" Tebidaba Brass pipeline | NA | NA | NA | 13 | 5.0 | 81.2 |
| 18" Tebidaba/Brass Pipeline | Bayelsa | SAR | NA | 86 | 6.0 | 17.2 |
| 10" Clough Creek/Tebidaba Pipeline | Bayelsa | SAR | NA | 102 | 7.0 | 13.5 |
| 14'' Ogboinbiri/ Tebidaba P/L | NA | NA | NA | 6 | 9.5 | 79.3 |
| 14" Ogboinbiri/Tebidaba pipeline | Bayelsa | SAR | NA | 29 | 10.0 | 37.7 |
| 24" Ogoda/Brass Pipeline | Bayelsa | SAR | NA | 3 | 10.0 | 17.3 |
| 18" EOC/Ogoda Pipeline | Rivers | LAR | Pipeline (Oil) | 18 | 10.5 | 45.4 |
| 6'' Tuomo/Ogboinbiri flowline | Bayelsa | SAR | Flowline | 10 | 10.5 | 54.3 |
| 4'' Oshie 15Ss flowline | Rivers | LAR | NA | 13 | 11.0 | 78.1 |
| 6'' Tuomo/Ogboinbiri pipeline | Bayelsa | LAR | NA | 9 | 11.0 | 41.3 |
| Obiafu 41/42T flowline | Rivers | LAR | NA | 4 | 12.0 | 21.0 |
| 14'' Akri/Ebocha pipeline | Imo | LAR | NA | 7 | 13.0 | 43.1 |
| 24" Ogoda/Brass Pipeline | Bayelsa | SAR | Pipeline (Oil) | 6 | 13.5 | 33.7 |
| 24" Ogoda/Brass Pipeline | Rivers | LAR | NA | 29 | 15.0 | 38.8 |
| Ebocha 5T flowline | Rivers | LAR | Flowline | 7 | 16.0 | 249.7 |
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test
inc_py = r.incidents_py
inc_py["date_incident_dt"] = pd.to_datetime(
inc_py["date_incident_str"], errors="coerce"
)
tte_py = (
inc_py.dropna(subset=["facility", "date_incident_dt"])
.sort_values(["facility", "date_incident_dt"])
.copy()
)
tte_py["prev_incident"] = tte_py.groupby("facility")[
"date_incident_dt"].shift(1)
tte_py["tte_days"] = (
tte_py["date_incident_dt"] - tte_py["prev_incident"]).dt.days
tte_py = tte_py.dropna(subset=["tte_days"])
tte_py = tte_py[tte_py["tte_days"] > 0].copy()
tte_py["event"] = 1
print(f"Inter-incident intervals available: {len(tte_py):,}")Inter-incident intervals available: 1,084
Code
print(f"Unique facilities with repeat incidents: {tte_py['facility'].nunique():,}")Unique facilities with repeat incidents: 147
Code
print(f"Median inter-incident time: {tte_py['tte_days'].median():.0f} days")Median inter-incident time: 13 days
Code
print(f"Mean inter-incident time: {tte_py['tte_days'].mean():.1f} days")Mean inter-incident time: 117.9 days
Code
# ── Step 2: KM curves by facility category ────────────────────────
top_cats_py = (
tte_py.dropna(subset=["facility_cat"])
.groupby("facility_cat").size()
.sort_values(ascending=False)
.head(5).index.tolist()
)
fig, ax = plt.subplots(figsize=(10, 5))
kmf_py = KaplanMeierFitter()
palette = ["#2c7fb8", "#e05c2a", "#27ae60", "#9b59b6", "#e67e22"]
for i, cat in enumerate(top_cats_py):
sub = tte_py[tte_py["facility_cat"] == cat]
if len(sub) >= 5:
kmf_py.fit(sub["tte_days"],
event_observed=sub["event"],
label=str(cat))
kmf_py.plot_survival_function(
ax=ax, ci_show=False,
color=palette[i % len(palette)])<lifelines.KaplanMeierFitter:"Pipeline (Oil)", fitted with 367 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"Flowline", fitted with 113 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"Pipeline (Gas)", fitted with 6 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
Code
ax.set_xlim(0, 500)(0.0, 500.0)
Code
ax.set_title("Kaplan-Meier: Time Between Incidents by Facility Category",
fontsize=12, fontweight="bold")Text(0.5, 1.0, 'Kaplan-Meier: Time Between Incidents by Facility Category')
Code
ax.set_xlabel("Days Since Previous Incident")Text(0.5, 0, 'Days Since Previous Incident')
Code
ax.set_ylabel("Probability of Remaining Incident-Free")Text(0, 0.5, 'Probability of Remaining Incident-Free')
Code
ax.grid(True, alpha=0.3)
ax.legend(title="Facility Category", fontsize=8)<matplotlib.legend.Legend object at 0x00000173F4289DD0>
Code
plt.tight_layout()
plt.show()Code
# ── Step 3: KM by operational area with log-rank test ────────────
lar = tte_py[tte_py["area"] == "LAR"]
sar = tte_py[tte_py["area"] == "SAR"]
lr = logrank_test(
lar["tte_days"], sar["tte_days"],
event_observed_A=lar["event"],
event_observed_B=sar["event"]
)
fig, ax = plt.subplots(figsize=(10, 5))
for area, data, col in [("LAR", lar, "#2c7fb8"),
("SAR", sar, "#e05c2a")]:
kmf_a = KaplanMeierFitter()
kmf_a.fit(data["tte_days"],
event_observed=data["event"],
label=area)
kmf_a.plot_survival_function(ax=ax, ci_show=True,
color=col, alpha=0.8)<lifelines.KaplanMeierFitter:"LAR", fitted with 446 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"SAR", fitted with 611 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
Code
ax.set_xlim(0, 500)(0.0, 500.0)
Code
ax.set_title(
"Kaplan-Meier: Time Between Incidents by Operational Area (LAR vs SAR)",
fontsize=12, fontweight="bold")Text(0.5, 1.0, 'Kaplan-Meier: Time Between Incidents by Operational Area (LAR vs SAR)')
Code
ax.set_xlabel("Days Since Previous Incident")Text(0.5, 0, 'Days Since Previous Incident')
Code
ax.set_ylabel("Probability of Remaining Incident-Free")Text(0, 0.5, 'Probability of Remaining Incident-Free')
Code
ax.text(0.98, 0.98, f"Log-rank p = {lr.p_value:.4f}",
transform=ax.transAxes, ha="right", va="top",
fontsize=10,
bbox=dict(boxstyle="round,pad=0.3",
facecolor="white", alpha=0.8))Text(0.98, 0.98, 'Log-rank p = 0.0000')
Code
ax.grid(True, alpha=0.3)
ax.legend()<matplotlib.legend.Legend object at 0x00000173F4243A90>
Code
plt.tight_layout()
plt.show()Code
print(f"\nLog-rank LAR vs SAR: statistic={lr.test_statistic:.3f}, "
f"p={lr.p_value:.4f}")
Log-rank LAR vs SAR: statistic=240.625, p=0.0000
Code
# ── Step 4: Cox Proportional Hazards model ────────────────────────
cox_data_py = (
tte_py[["tte_days", "event", "facility_cat", "state"]]
.dropna()
.query("facility_cat in @top_cats_py")
.copy()
)
cox_dummies = pd.get_dummies(
cox_data_py,
columns=["facility_cat", "state"],
drop_first=True
)
cph = CoxPHFitter()
cph.fit(cox_dummies, duration_col="tte_days", event_col="event")<lifelines.CoxPHFitter: fitted with 490 total observations, 0 right-censored observations>
Code
print("\nCox PH Model Summary:")
Cox PH Model Summary:
Code
cph.print_summary()<lifelines.CoxPHFitter: fitted with 490 total observations, 0 right-censored observations>
duration col = 'tte_days'
event col = 'event'
baseline estimation = breslow
number of observations = 490
number of events observed = 490
partial log-likelihood = -2465.47
time fit was run = 2026-05-12 14:59:53 UTC
---
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
covariate
facility_cat_Manifold 0.27 1.32 1.01 -1.70 2.25 0.18 9.45
facility_cat_Pipeline (Gas) -0.88 0.42 0.43 -1.72 -0.03 0.18 0.97
facility_cat_Pipeline (Oil) 0.82 2.26 0.13 0.57 1.06 1.77 2.90
facility_cat_Well Head -1.27 0.28 0.59 -2.43 -0.11 0.09 0.90
state_Delta -1.30 0.27 0.36 -2.01 -0.59 0.13 0.55
state_Imo -1.82 0.16 0.46 -2.71 -0.92 0.07 0.40
state_Rivers -0.61 0.54 0.12 -0.84 -0.38 0.43 0.68
cmp to z p -log2(p)
covariate
facility_cat_Manifold 0.00 0.27 0.79 0.35
facility_cat_Pipeline (Gas) 0.00 -2.02 0.04 4.53
facility_cat_Pipeline (Oil) 0.00 6.47 <0.005 33.26
facility_cat_Well Head 0.00 -2.15 0.03 4.97
state_Delta 0.00 -3.59 <0.005 11.59
state_Imo 0.00 -3.98 <0.005 13.81
state_Rivers 0.00 -5.19 <0.005 22.19
---
Concordance = 0.68
Partial AIC = 4944.95
log-likelihood ratio test = 167.60 on 7 df
-log2(p) of ll-ratio test = 106.62
Code
fig, ax = plt.subplots(figsize=(9, 5))
cph.plot(ax=ax)<Axes: xlabel='log(HR) (95% CI)'>
Code
ax.set_title(
"Cox PH Model: Hazard Ratios by Facility Category and State",
fontsize=11, fontweight="bold")Text(0.5, 1.0, 'Cox PH Model: Hazard Ratios by Facility Category and State')
Code
ax.axvline(0, linestyle="--", color="grey", linewidth=0.8)<matplotlib.lines.Line2D object at 0x00000173F4132450>
Code
plt.tight_layout()
plt.show()Code
# ── Step 5: High-recurrence facility table ────────────────────────
high_risk_py = (
tte_py.groupby(["facility", "state", "area", "facility_cat"])
.agg(
n_intervals = ("tte_days", "count"),
median_tte_days = ("tte_days", "median"),
mean_tte_days = ("tte_days", "mean")
)
.reset_index()
.query("n_intervals >= 3")
.sort_values("median_tte_days")
.head(20)
.round(1)
)
print("\nTop 20 Highest-Recurrence Facilities:")
Top 20 Highest-Recurrence Facilities:
Code
print(high_risk_py.to_string(index=False)) facility state area facility_cat n_intervals median_tte_days mean_tte_days
Ebocha 4Ls & Ss flowline Rivers LAR Flowline 4 1.5 1.5
18" Tebidaba/Brass Pipeline Bayelsa SAR Pipeline (Oil) 131 2.0 6.8
10" Clough Creek/Tebidaba Pipeline Bayelsa SAR Pipeline (Oil) 141 3.0 5.6
18" EOC/Ogoda Pipeline Rivers LAR Pipeline (Oil) 18 10.5 45.4
6'' Tuomo/Ogboinbiri flowline Bayelsa SAR Flowline 10 10.5 54.3
24" Ogoda/Brass Pipeline Bayelsa SAR Pipeline (Oil) 6 13.5 33.7
Ebocha 5T flowline Rivers LAR Flowline 7 16.0 249.7
24" Ogoda/Brass Pipeline Rivers LAR Pipeline (Oil) 19 18.0 29.1
14" Ogboinbiri/Tebidaba pipeline Bayelsa SAR Pipeline (Oil) 14 20.0 84.9
Ebegoro 11T flowline Rivers LAR Flowline 9 28.0 27.6
Ebegoro 12Ls flowline Rivers LAR Flowline 6 31.5 132.5
Taylor Creek 2Ls flowline Bayelsa LAR Flowline 4 45.0 44.2
Oshie 15Ss flowline Rivers LAR Flowline 5 69.0 110.2
18" Obama/Brass Pipeline Bayelsa SAR Pipeline (Oil) 6 69.5 208.7
10" Oshie/Ogoda Pipeline Rivers LAR Pipeline (Oil) 5 80.0 136.6
Taylor Creek 2Ss flowline Bayelsa LAR Flowline 5 87.0 130.0
14'' Ogboinbiri/Tebidaba pipeline riser Bayelsa SAR Pipeline (Oil) 3 94.0 118.0
10" Kwale/Akri Pipeline Delta LAR Pipeline (Oil) 4 96.5 163.2
Ebegoro 2T flowline Rivers LAR Flowline 3 115.0 162.7
6'' Ogboinbiri/Tebidaba pipeline Bayelsa SAR Pipeline (Oil) 4 150.5 252.8
8.4 Management Interpretation
What inter-incident time means and why it matters more than incident count
The total number of incidents at a facility tells you how much has gone wrong historically. The time between incidents tells you how quickly things go wrong again after a repair or intervention. Consider two facilities, each with ten incidents over seven years:
- Facility A experiences incidents spread evenly at roughly ten-month intervals — a low but steady recurrence pattern suggesting manageable wear and predictable maintenance needs
- Facility B experiences all ten incidents within eighteen months, followed by a long quiet period — a structural vulnerability that burst into recurring failure before being addressed
Survival analysis separates these two profiles explicitly. It is Facility B that should command priority in the 2026 inspection programme, not simply the facility with the highest total count.
What the Kaplan-Meier curves reveal
Each curve shows the probability that a facility of a given type or in a given area will remain incident-free as time passes after a previous incident. Reading the curves operationally:
- A curve that drops to 50% within 60 days means half of all incidents at that facility type are followed by another incident within two months. This is a structural recurrence pattern
- A curve that remains above 80% for 200 or more days means the asset class is generally resilient and re-incidents are relatively infrequent
- The gap between the LAR and SAR curves, combined with the log-rank p-value, determines whether one operational area carries a structurally higher recurrence risk than the other and should therefore receive a larger share of inspection resources
What the Cox model adds
The Kaplan-Meier curves show what is happening visually. The Cox model quantifies the effect of each covariate while holding all others constant. Three practical implications follow from the hazard ratio table:
- A hazard ratio above 1.0 for a facility category means that asset class re-incidents faster than the reference category, all else being equal. The magnitude of the ratio indicates how much faster
- A statistically significant ratio (p less than 0.05) means the difference is unlikely to be explained by chance and can be used to justify differentiated budget allocation
- A non-significant ratio means the difference in recurrence rate between that category and the reference is not reliably distinguishable from random variation in the data
What the proportional hazards test confirms
The Schoenfeld residuals test checks whether the hazard ratios are stable over time. A non-significant p-value for each covariate confirms that the ratios hold across the full 2019 to 2025 observation window and can be applied to the 2026 planning horizon with confidence. If any covariate fails the test, its hazard ratio should be interpreted with caution and a time-varying coefficient model considered as an extension in future work.
Recommendation feeding into Section 10
Two concrete outputs from this section feed into the integrated recommendation:
- The facilities in the bottom quartile of median inter-incident time from Table SA2 constitute the mandatory inclusion list for the 2026 Q1 integrity inspection programme
- The facility category and state with the highest statistically significant hazard ratio from the Cox model should receive a proportionally larger share of the integrity inspection budget relative to lower-hazard asset classes
9. Technique 5: Association Rules (Operational Co-occurrence Analysis)
9.1 Technique Overview
Association rule mining is a data mining technique designed to discover interesting relationships, co-occurrence patterns, and dependencies between variables in large transactional datasets (Marr, 2017; Agrawal and Srikant, 1994). Originally developed for market basket analysis — identifying which products customers tend to purchase together — the technique has broad applicability wherever the goal is to find combinations of attributes that co-occur more frequently than chance would predict. In an operational context, it answers the question: given that an incident has a particular cause, location, and facility type, what else is reliably true about it?
The Apriori algorithm, the standard implementation used here, works in three stages:
- Identify frequent itemsets: All attribute combinations that meet a minimum frequency threshold (support) are retained. The minimum support of 0.02 used here ensures each rule covers at least 34 of the 1,695 incidents, providing a statistically reliable foundation
- Evaluate confidence: For each frequent itemset, the algorithm assesses how reliably the antecedent attributes predict the consequent attribute. Only combinations above the minimum confidence threshold of 0.60 are retained as rules
- Rank by lift: Rules are ranked by how much more likely the consequent is given the antecedent than it would be by chance alone. Lift is the primary operational metric because it identifies genuinely non-obvious co-occurrences rather than simply frequent ones
Three metrics define every rule:
- Support: The proportion of all incidents in which the complete rule, both antecedent and consequent, appears together. A support of 0.05 means the combination appears in at least 5% of all 1,695 incidents. Low support rules may be statistically unreliable and are excluded by the minimum threshold
- Confidence: The conditional probability that the consequent appears given the antecedent. A confidence of 0.80 for the rule {Facility: Pipeline (Oil), Area: LAR} → {Sabotage} means that in 80% of incidents where the facility is an oil pipeline in the Lagos Area, the incident is classified as Sabotage. High confidence identifies reliable operational patterns
- Lift: The ratio of the observed confidence to the confidence that would be expected if the antecedent and consequent were statistically independent. A lift of 3.0 means the consequent is three times more likely given the antecedent than it would be by chance. A lift of 1.0 means the rule adds no predictive value beyond the base rate
The minimum lift threshold of 1.2 and the focus on rules with Sabotage as the consequent are both operationally motivated. The goal is to identify attribute combinations that are meaningfully predictive of the incident type most damaging to production continuity.
9.2 Operational Justification
Sabotage accounts for 74% of all incidents in the OERNL dataset, but not all sabotage incidents are equivalent in their operational impact. The structured fields — cause, location, facility category, area, state — are currently reviewed individually during incident investigations. They have never been analysed simultaneously to identify which combinations of attributes reliably co-occur with high-severity sabotage outcomes.
Association rule mining addresses this gap in three ways:
- Security targeting: Rules with high lift linking specific cause types and locations to Sabotage classification identify the geographic and operational contexts where third-party interference is most concentrated. These rules feed directly into the security patrol schedule and community engagement targeting strategy for 2026
- Project scope refinement: Rules linking facility category and cause type to cleanup requirements or production deferment inform the scope of capital maintenance projects. If a high-lift rule identifies that hacksaw cuts at oil pipelines in Bayelsa reliably require cleanup, that combination should be explicitly scoped in the Kwale Flaredown and Brass remediation project budgets rather than treated as a contingency item
- Emergency response protocol design: Frequent itemsets linking incident type combinations to response actions allow the department to pre-position resources and design differentiated response protocols rather than applying a single generic response to all sabotage incidents regardless of their specific attribute profile
This section addresses sub-question (e) of the central analytical question: which combinations of cause type, location, and facility category are most strongly associated with Sabotage classification, and how should these combinations inform the 2026 security and community engagement targeting strategy?
9.3 Analysis and Outputs
Code
# ── Step 1: Build transaction dataset ────────────────────────────
tx <- incidents %>%
dplyr::transmute(
cause = stringr::word(stringr::str_to_title(
stringr::str_squish(cause)), 1, 4),
facility_cat = as.character(facility_cat),
area = as.character(area),
state = as.character(state),
cleanup = as.character(cleanup_required),
category = as.character(category)
) %>%
mutate(across(everything(),
~ ifelse(is.na(.x) | .x == "NA", "Unknown", .x))) %>%
mutate(
cause = paste0("cause=", cause),
facility_cat = paste0("fac=", facility_cat),
area = paste0("area=", area),
state = paste0("state=", state),
cleanup = paste0("cleanup=", cleanup),
category = paste0("category=", category)
)
# ── Step 2: Convert to transactions object ────────────────────────
trans <- as(tx, "transactions")
# ── Step 3: Inspect most frequent items ──────────────────────────
item_freq <- itemFrequency(trans, type = "relative")
top_items <- sort(item_freq, decreasing = TRUE)[1:20]
clean_item_label <- function(x) {
x %>%
stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>%
stringr::str_replace_all("cause=cause=", "Cause: ") %>%
stringr::str_replace_all("area=area=", "Area: ") %>%
stringr::str_replace_all("state=state=", "State: ") %>%
stringr::str_replace_all("cleanup=cleanup=", "Cleanup: ") %>%
stringr::str_replace_all("category=category=", "Category: ") %>%
stringr::str_replace_all("fac=", "Facility: ") %>%
stringr::str_replace_all("cause=", "Cause: ") %>%
stringr::str_replace_all("area=", "Area: ") %>%
stringr::str_replace_all("state=", "State: ") %>%
stringr::str_replace_all("cleanup=", "Cleanup: ") %>%
stringr::str_replace_all("category=", "Category: ")
}
top_items_df <- tibble::tibble(
item = clean_item_label(names(top_items)),
frequency = as.numeric(top_items)
)
ggplot(top_items_df,
aes(x = frequency,
y = forcats::fct_reorder(item, frequency))) +
geom_col(fill = "#2c7fb8", width = 0.7) +
geom_text(aes(label = scales::percent(frequency, accuracy = 0.1)),
hjust = -0.1, size = 3, fontface = "bold") +
scale_x_continuous(labels = scales::percent,
expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Top 20 Most Frequent Items Across All Incident Transactions",
subtitle = "Item frequency as proportion of all 1,695 incidents",
x = "Support (Proportion of Incidents)",
y = NULL
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)Code
# ── Step 4: Mine association rules ────────────────────────────────
rules <- apriori(
trans,
parameter = list(
supp = 0.02,
conf = 0.60,
minlen = 2,
maxlen = 4
),
control = list(verbose = FALSE)
)
# ── Step 5: Filter to rules with Sabotage as consequent ──────────
sabotage_rules <- subset(
rules,
subset = rhs %pin% "category=Sabotage" & lift > 1.2
)
top_rules <- head(sort(sabotage_rules, by = "lift"), 20)
# ── Step 6: Display top rules as clean table ──────────────────────
clean_rule_text <- function(x) {
x %>%
stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>%
stringr::str_replace_all("cause=cause=", "Cause: ") %>%
stringr::str_replace_all("area=area=", "Area: ") %>%
stringr::str_replace_all("state=state=", "State: ") %>%
stringr::str_replace_all("cleanup=cleanup=", "Cleanup: ") %>%
stringr::str_replace_all("category=category=", "") %>%
stringr::str_replace_all("fac=", "Facility: ") %>%
stringr::str_replace_all("cause=", "Cause: ") %>%
stringr::str_replace_all("area=", "Area: ") %>%
stringr::str_replace_all("state=", "State: ") %>%
stringr::str_replace_all("cleanup=", "Cleanup: ") %>%
stringr::str_replace_all("category=", "")
}
rules_df <- as(top_rules, "data.frame") %>%
dplyr::arrange(desc(lift)) %>%
dplyr::mutate(
Rule = clean_rule_text(as.character(rules)),
Support = round(support, 4),
Confidence = round(confidence, 3),
Lift = round(lift, 3),
Count = as.integer(count)
) %>%
dplyr::select(Rule, Support, Confidence, Lift, Count)
kableExtra::kable(rules_df,
col.names = c("Rule", "Support", "Confidence",
"Lift", "Count"),
caption = "Table AR1: Top Association Rules — Sabotage as Consequent (Ranked by Lift)") %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 11
) %>%
kableExtra::column_spec(1, width = "55%") %>%
kableExtra::column_spec(4, bold = TRUE, color = "#e05c2a") %>%
kableExtra::footnote(
general = "Support = proportion of all 1,695 incidents containing the full rule. Confidence = probability of Sabotage given the antecedent conditions. Lift > 1.0 = association stronger than chance."
)| Rule | Support | Confidence | Lift | Count |
|---|---|---|---|---|
| {Facility: Pipeline (Oil),Area: LAR,Cleanup: YES} => {Sabotage} | 0.0313 | 0.964 | 1.305 | 53 |
| {Facility: Pipeline (Oil),State: Rivers,Cleanup: YES} => {Sabotage} | 0.0271 | 0.958 | 1.297 | 46 |
| {Facility: Unknown,Area: SAR,Cleanup: YES} => {Sabotage} | 0.0401 | 0.958 | 1.297 | 68 |
| {Facility: Unknown,State: Rivers,Cleanup: YES} => {Sabotage} | 0.0460 | 0.951 | 1.288 | 78 |
| {Facility: Unknown,State: Bayelsa,Cleanup: YES} => {Sabotage} | 0.0454 | 0.928 | 1.256 | 77 |
| {Facility: Unknown,Cleanup: YES} => {Sabotage} | 0.0956 | 0.920 | 1.246 | 162 |
| {Cause: Unknown,Facility: Unknown,Cleanup: YES} => {Sabotage} | 0.0944 | 0.920 | 1.245 | 160 |
| {Facility: Pipeline (Oil),Cleanup: YES} => {Sabotage} | 0.0844 | 0.911 | 1.233 | 143 |
| {Cause: Unknown,State: Rivers,Cleanup: YES} => {Sabotage} | 0.1068 | 0.910 | 1.231 | 181 |
| {Cause: Unknown,Facility: Pipeline (Oil),Cleanup: YES} => {Sabotage} | 0.0826 | 0.909 | 1.231 | 140 |
| {State: Rivers,Cleanup: YES} => {Sabotage} | 0.1068 | 0.905 | 1.225 | 181 |
| {Area: LAR,State: Rivers,Cleanup: YES} => {Sabotage} | 0.1068 | 0.905 | 1.225 | 181 |
| {Facility: Pipeline (Oil),State: Rivers} => {Sabotage} | 0.0324 | 0.902 | 1.221 | 55 |
| {Facility: Pipeline (Oil),Area: LAR,State: Rivers} => {Sabotage} | 0.0324 | 0.902 | 1.221 | 55 |
| {Cause: Unknown,Facility: Pipeline (Oil),State: Rivers} => {Sabotage} | 0.0324 | 0.902 | 1.221 | 55 |
| {Facility: Flowline,State: Rivers,Cleanup: YES} => {Sabotage} | 0.0319 | 0.900 | 1.218 | 54 |
| {Facility: Pipeline (Oil),Area: SAR} => {Sabotage} | 0.1923 | 0.898 | 1.216 | 326 |
| {Facility: Pipeline (Oil),State: Bayelsa} => {Sabotage} | 0.1923 | 0.898 | 1.216 | 326 |
| {Facility: Pipeline (Oil),Area: SAR,State: Bayelsa} => {Sabotage} | 0.1923 | 0.898 | 1.216 | 326 |
| {Facility: Pipeline (Oil),Area: SAR,Cleanup: Unknown} => {Sabotage} | 0.1286 | 0.897 | 1.215 | 218 |
| Note: | ||||
| Support = proportion of all 1,695 incidents containing the full rule. Confidence = probability of Sabotage given the antecedent conditions. Lift > 1.0 = association stronger than chance. |
Code
# ── Step 7: All rules scatter plot ───────────────────────────────
all_rules_df <- as(rules, "data.frame") %>%
dplyr::mutate(
is_sabotage = stringr::str_detect(rules, "category=Sabotage")
)
ggplot(all_rules_df,
aes(x = support, y = confidence,
colour = lift, size = lift,
alpha = is_sabotage)) +
geom_point() +
scale_colour_gradient(low = "#a8c8e8",
high = "#e05c2a",
name = "Lift") +
scale_size_continuous(range = c(0.5, 4), name = "Lift") +
scale_alpha_manual(values = c("FALSE" = 0.2, "TRUE" = 0.8),
name = "Sabotage Rule") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Association Rules: Support vs Confidence (Colour = Lift)",
subtitle = "Orange-highlighted points = rules with Sabotage as consequent",
x = "Support",
y = "Confidence"
) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Code
# ── Step 8: Interactive network graph via arulesViz ───────────────
grDevices::pdf(nullfile())
invisible(capture.output({
network_widget <- plot(top_rules,
method = "graph",
engine = "htmlwidget",
control = list(
main = "Association Rule Network: Top 20 Sabotage Rules by Lift",
verbose = FALSE
))
}))
grDevices::dev.off()png
2
Code
widget_path <- here::here("output", "ar_network.html")
htmlwidgets::saveWidget(network_widget, widget_path,
selfcontained = TRUE)
htmltools::tags$iframe(
src = widget_path,
width = "100%",
height = "550px",
style = "border:none;"
)Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
inc_py = r.incidents_py
cols = ["cause", "facility_cat", "area", "state",
"cleanup_required", "category"]
tx_py = inc_py[cols].fillna("Unknown").astype(str).copy()
tx_py["cause"] = tx_py["cause"].str.split().str[:4].str.join(" ")
prefix_map = {
"cause": "cause=",
"facility_cat": "fac=",
"area": "area=",
"state": "state=",
"cleanup_required": "cleanup=",
"category": "category="
}
for col, prefix in prefix_map.items():
tx_py[col] = prefix + tx_py[col]
tx_list = tx_py.values.tolist()
te = TransactionEncoder()
te.fit(tx_list)TransactionEncoder()
Code
X = pd.DataFrame(te.transform(tx_list), columns=te.columns_)
freq_items = apriori(X, min_support=0.02, use_colnames=True)
freq_items["length"] = freq_items["itemsets"].apply(len)
rules_py = association_rules(
freq_items,
metric = "lift",
min_threshold = 1.2
)
rules_py["antecedents_str"] = rules_py["antecedents"].apply(
lambda x: ", ".join(sorted(x))
)
rules_py["consequents_str"] = rules_py["consequents"].apply(
lambda x: ", ".join(sorted(x))
)
sab_rules = rules_py[
rules_py["consequents_str"].str.contains("Sabotage", na=False)
].copy()
top_sab = (sab_rules
.sort_values("lift", ascending=False)
.head(20)
[["antecedents_str", "consequents_str",
"support", "confidence", "lift"]]
.round(4))
def clean_label_py(x):
return (x.replace("facility_cat=fac=", "Facility: ")
.replace("cause=cause=", "Cause: ")
.replace("area=area=", "Area: ")
.replace("state=state=", "State: ")
.replace("cleanup=cleanup=", "Cleanup: ")
.replace("category=category=", "")
.replace("fac=", "Facility: ")
.replace("cause=", "Cause: ")
.replace("area=", "Area: ")
.replace("state=", "State: ")
.replace("cleanup=", "Cleanup: ")
.replace("category=", ""))
top_sab["antecedents_str"] = top_sab["antecedents_str"].apply(
clean_label_py)
top_sab["consequents_str"] = top_sab["consequents_str"].apply(
clean_label_py)
print("Top 20 Sabotage Rules by Lift:")Top 20 Sabotage Rules by Lift:
Code
print(top_sab.to_string(index=False)) antecedents_str consequents_str support confidence lift
Area: LAR, Cause: Oil Theft, Cleanup: YES Sabotage, Facility: Pipeline (Oil), State: Rivers 0.0212 0.6792 20.9331
Cause: Oil Theft, Cleanup: YES, State: Rivers Area: LAR, Sabotage, Facility: Pipeline (Oil) 0.0212 0.7826 19.5077
Facility: Pipeline (Oil), State: Rivers Area: LAR, Sabotage, Cause: Oil Theft, Cleanup: YES 0.0212 0.5902 18.8741
Area: LAR, Facility: Pipeline (Oil) Sabotage, Cause: Oil Theft, Cleanup: YES, State: Rivers 0.0212 0.4675 17.2276
Cause: Oil Theft, State: Rivers Area: LAR, Sabotage, Cleanup: YES, Facility: Pipeline (Oil) 0.0212 0.4865 15.5584
Area: LAR, Cleanup: YES, Facility: Pipeline (Oil) Sabotage, Cause: Oil Theft, State: Rivers 0.0212 0.6545 14.9926
Cause: Oil Theft, State: Rivers Area: LAR, Sabotage, Facility: Pipeline (Oil) 0.0236 0.5405 13.4738
Area: LAR, Cause: Oil Theft Sabotage, Cleanup: YES, Facility: Pipeline (Oil), State: Rivers 0.0212 0.3495 12.8789
Cleanup: YES, Facility: Pipeline (Oil), State: Rivers Area: LAR, Sabotage, Cause: Oil Theft 0.0212 0.7500 12.3422
Area: LAR, Cause: Oil Theft Sabotage, Facility: Pipeline (Oil), State: Rivers 0.0236 0.3883 11.9682
Area: LAR, Facility: Pipeline (Oil) Sabotage, Cause: Oil Theft, State: Rivers 0.0236 0.5195 11.8989
Facility: Pipeline (Oil), State: Rivers Area: LAR, Sabotage, Cause: Oil Theft 0.0236 0.6557 10.7910
Cause: Oil Theft, Facility: Pipeline (Oil), State: Rivers Area: LAR, Sabotage, Cleanup: YES 0.0212 0.9000 7.1285
Area: LAR, Cause: Oil Theft, Facility: Pipeline (Oil) Sabotage, Cleanup: YES, State: Rivers 0.0212 0.7500 7.0235
Cause: Oil Theft, Cleanup: YES Area: LAR, Sabotage, Facility: Pipeline (Oil), State: Rivers 0.0212 0.2156 6.6434
Cause: Oil Theft, Cleanup: YES Sabotage, Facility: Pipeline (Oil), State: Rivers 0.0212 0.2156 6.6434
Cleanup: YES, State: Rivers Area: LAR, Sabotage, Cause: Oil Theft, Facility: Pipeline (Oil) 0.0212 0.1800 6.3562
Area: LAR, Cleanup: YES Sabotage, Cause: Oil Theft, Facility: Pipeline (Oil), State: Rivers 0.0212 0.1457 6.1761
Facility: Pipeline (Oil), State: Rivers Sabotage, Cause: Oil Theft, Cleanup: YES 0.0212 0.5902 5.9900
Area: LAR, Facility: Pipeline (Oil), State: Rivers Sabotage, Cause: Oil Theft, Cleanup: YES 0.0212 0.5902 5.9900
Code
fig, ax = plt.subplots(figsize=(10, 6))
non_sab = rules_py[
~rules_py["consequents_str"].str.contains("Sabotage", na=False)
]
sab_sub = rules_py[
rules_py["consequents_str"].str.contains("Sabotage", na=False)
]
ax.scatter(non_sab["support"], non_sab["confidence"],
c=non_sab["lift"], cmap="Blues",
alpha=0.3, s=20, label="Other rules")<matplotlib.collections.PathCollection object at 0x00000173F6569E90>
Code
sc2 = ax.scatter(sab_sub["support"], sab_sub["confidence"],
c=sab_sub["lift"], cmap="Oranges",
alpha=0.8, s=40, label="Sabotage rules")
plt.colorbar(sc2, ax=ax, label="Lift")<matplotlib.colorbar.Colorbar object at 0x00000173F6621DD0>
Code
ax.set_title(
"Association Rules: Support vs Confidence\n"
"(Orange = Sabotage consequent, Colour intensity = Lift)",
fontsize=11, fontweight="bold"
)Text(0.5, 1.0, 'Association Rules: Support vs Confidence\n(Orange = Sabotage consequent, Colour intensity = Lift)')
Code
ax.set_xlabel("Support")Text(0.5, 0, 'Support')
Code
ax.set_ylabel("Confidence")Text(0, 0.5, 'Confidence')
Code
ax.xaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.legend()<matplotlib.legend.Legend object at 0x00000173F656B510>
Code
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Code
top15 = top_sab.head(15).copy()
top15["rule_label"] = (
top15["antecedents_str"].str[:45] + " => Sabotage"
)
fig, ax = plt.subplots(figsize=(11, 6))
bars = ax.barh(range(len(top15)), top15["lift"],
color="#e05c2a", alpha=0.85, height=0.6)
ax.set_yticks(range(len(top15)))[<matplotlib.axis.YTick object at 0x00000173F415E7D0>, <matplotlib.axis.YTick object at 0x00000173F6378090>, <matplotlib.axis.YTick object at 0x00000173F4201F50>, <matplotlib.axis.YTick object at 0x00000173F4201150>, <matplotlib.axis.YTick object at 0x00000173F4189F90>, <matplotlib.axis.YTick object at 0x00000173F4188190>, <matplotlib.axis.YTick object at 0x00000173F41BFDD0>, <matplotlib.axis.YTick object at 0x00000173F41BE590>, <matplotlib.axis.YTick object at 0x00000173802A1150>, <matplotlib.axis.YTick object at 0x00000173F42A1150>, <matplotlib.axis.YTick object at 0x00000173F41BCC50>, <matplotlib.axis.YTick object at 0x00000173F42530D0>, <matplotlib.axis.YTick object at 0x00000173F4282E90>, <matplotlib.axis.YTick object at 0x00000173F63ED6D0>, <matplotlib.axis.YTick object at 0x00000173F41D2E90>]
Code
ax.set_yticklabels(top15["rule_label"], fontsize=8)[Text(0, 0, 'Area: LAR, Cause: Oil Theft, Cleanup: YES => Sabotage'), Text(0, 1, 'Cause: Oil Theft, Cleanup: YES, State: Rivers => Sabotage'), Text(0, 2, 'Facility: Pipeline (Oil), State: Rivers => Sabotage'), Text(0, 3, 'Area: LAR, Facility: Pipeline (Oil) => Sabotage'), Text(0, 4, 'Cause: Oil Theft, State: Rivers => Sabotage'), Text(0, 5, 'Area: LAR, Cleanup: YES, Facility: Pipeline ( => Sabotage'), Text(0, 6, 'Cause: Oil Theft, State: Rivers => Sabotage'), Text(0, 7, 'Area: LAR, Cause: Oil Theft => Sabotage'), Text(0, 8, 'Cleanup: YES, Facility: Pipeline (Oil), State => Sabotage'), Text(0, 9, 'Area: LAR, Cause: Oil Theft => Sabotage'), Text(0, 10, 'Area: LAR, Facility: Pipeline (Oil) => Sabotage'), Text(0, 11, 'Facility: Pipeline (Oil), State: Rivers => Sabotage'), Text(0, 12, 'Cause: Oil Theft, Facility: Pipeline (Oil), S => Sabotage'), Text(0, 13, 'Area: LAR, Cause: Oil Theft, Facility: Pipeli => Sabotage'), Text(0, 14, 'Cause: Oil Theft, Cleanup: YES => Sabotage')]
Code
ax.set_xlabel("Lift")Text(0.5, 0, 'Lift')
Code
ax.set_title(
"Top 15 Association Rules with Sabotage as Consequent (Ranked by Lift)",
fontsize=11, fontweight="bold"
)Text(0.5, 1.0, 'Top 15 Association Rules with Sabotage as Consequent (Ranked by Lift)')
Code
ax.axvline(1.0, linestyle="--", color="grey", linewidth=0.8)<matplotlib.lines.Line2D object at 0x00000173F414FC90>
Code
ax.invert_yaxis()
for bar, conf in zip(bars, top15["confidence"]):
ax.text(bar.get_width() + 0.005,
bar.get_y() + bar.get_height() / 2,
f"conf={conf:.2f}",
va="center", fontsize=7)Text(20.9381, 0.0, 'conf=0.68')
Text(19.5127, 1.0, 'conf=0.78')
Text(18.879099999999998, 2.0, 'conf=0.59')
Text(17.232599999999998, 3.0, 'conf=0.47')
Text(15.563400000000001, 4.0, 'conf=0.49')
Text(14.9976, 5.0, 'conf=0.65')
Text(13.478800000000001, 6.0, 'conf=0.54')
Text(12.8839, 7.0, 'conf=0.35')
Text(12.3472, 8.0, 'conf=0.75')
Text(11.9732, 9.0, 'conf=0.39')
Text(11.9039, 10.0, 'conf=0.52')
Text(10.796000000000001, 11.0, 'conf=0.66')
Text(7.1335, 12.0, 'conf=0.90')
Text(7.0285, 13.0, 'conf=0.75')
Text(6.6484, 14.0, 'conf=0.22')
Code
plt.tight_layout()
plt.show()Code
print("\nSummary statistics of all Sabotage rules:")
Summary statistics of all Sabotage rules:
Code
print(sab_rules[["support", "confidence", "lift"]].describe().round(4)) support confidence lift
count 1362.0000 1362.0000 1362.0000
mean 0.0637 0.4908 2.2743
std 0.0532 0.3175 1.4935
min 0.0201 0.0399 1.2004
25% 0.0283 0.2048 1.5541
50% 0.0383 0.4432 2.0257
75% 0.0802 0.7541 2.5354
max 0.3841 1.0000 20.9331
9.4 Management Interpretation
What association rules are and how to read them
Association rule mining works like an experienced investigator who has read all 1,695 incident reports and noticed that certain combinations of facts tend to appear together far more often than chance would predict. The algorithm formalises that intuition by scanning every possible combination of cause type, location, facility category, area, state, and cleanup requirement, and surfacing only those combinations that are both frequent enough to be reliable and strong enough to be surprising.
Each rule comes with three numbers that tell a precise operational story:
- Support answers: how often does this exact combination appear across all 1,695 incidents? A support of 0.03 means the combination appears in about 51 incidents — enough to be statistically reliable but specific enough to be operationally targeted
- Confidence answers: when the antecedent conditions are present, how reliably does Sabotage follow? A confidence of 0.96 means sabotage classification follows in 96 out of every 100 incidents with that attribute combination
- Lift answers: how much more predictive is this combination than the overall sabotage rate alone? A lift of 1.3 means the combination is 30% more predictive of sabotage than the base rate. This is the number that determines whether a rule is genuinely actionable intelligence or merely a restatement of what the base rate already tells us
What the top rules mean for 2026 planning
The rules with the highest lift values identify the most specific and most reliable combinations of incident attributes that co-occur with sabotage. These rules answer the question the security and community engagement teams most need heading into 2026: not simply where sabotage happens most often, but which precise combinations of location, facility type, and cause pattern are the strongest predictors of sabotage activity.
The top-ranked rule, {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305, carries a direct scheduling implication. When a cleanup-requiring incident occurs at an oil pipeline in the Lagos Area, it is classified as sabotage in over 96% of cases. This is not a marginal finding. Resources concentrated at oil pipelines in LAR during periods of elevated third-party activity will intercept a disproportionate share of sabotage events relative to their cost.
What the network graph reveals
The network graph visualises all 20 top rules simultaneously as a connected map. Reading it:
- Nodes represent individual incident attributes (facility type, area, state, cleanup status, cause)
- Edges represent rules, with arrows pointing toward the Sabotage target node
- Nodes with many edges connecting to them are the most diagnostically powerful attributes. They appear in the most high-confidence sabotage rules and anchor the security targeting brief
- Hovering over any edge displays the exact lift, confidence, and support for that specific rule
- Colour coding distinguishes attribute type: blue for facility type, green for area, purple for state, amber for cleanup status, and grey for cause
What this means for project scope
The rules linking specific cause and facility combinations to cleanup requirements have a direct capital implication. If a high-lift rule confirms that incidents at oil pipelines in SAR reliably require cleanup, that combination should be explicitly costed and scoped in the Brass Canal Remediation and Ebocha Water Reinjection project budgets. It belongs in the base estimate as a predictable and quantifiable cost element, not in the contingency reserve.
Recommendation feeding into Section 10
Two outputs from this section feed directly into the 2026 security strategy:
- The top five rules by lift from Table AR1 define the specific operational contexts that warrant elevated security presence and proactive community liaison in 2026
- The antecedent combinations in those rules — the precise combinations of facility type, area, and cause — replace broad geographic targeting with evidence-based attribute targeting that is both more efficient and more defensible to management
10. Integrated Findings
10.1 What the Five Analyses Collectively Show
Five analytical techniques were applied to the same 1,695-incident dataset, each addressing a different dimension of the 2026 operational planning problem. Taken individually, each produces a useful output. Taken together, they converge on a single, coherent operational picture that no individual analysis could produce alone.
Text Analytics (Section 5)
The text analytics revealed that incident narrative descriptions cluster into five latent topics. Two are operationally diagnostic:
- Topic 2, dominated by the terms person, installed, valve, cut, theft, and flowline, captures the language of deliberate human interference. An unknown person physically installing a bypass valve or cutting a flowline to extract oil
- Topic 1, centred on pipe, valve, nipple, and flange, captures a second pattern of hardware manipulation at pipe connections
Neither pattern maps cleanly onto the structured cause field. They emerge only from the unstructured narrative text. Together they confirm that the dominant failure mode across the portfolio is not mechanical degradation but organised, repeated physical interference at flowlines and valve points.
Survival Analysis (Section 8)
The Kaplan-Meier curves and Cox Proportional Hazards model show that the highest-recurrence facilities are concentrated in the SAR operational area:
- The 18” Tebidaba/Brass Pipeline records a median of just 2.0 days between consecutive incidents across 131 recorded intervals
- The 10” Clough Creek/Tebidaba Pipeline records a median of 3.0 days across 141 intervals
- The Cox model identifies Pipeline (Oil) in SAR and Bayelsa as the asset class carrying the highest hazard rate, controlling for all other factors
These are not facilities that occasionally fail. They are facilities under sustained, near-continuous attack.
Association Rules (Section 9)
The highest-lift rule in the dataset confirms the mechanism directly:
- {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305
- When a cleanup-requiring incident occurs at an oil pipeline, sabotage classification follows in 96.4% of cases, at a rate 30% above what the overall sabotage base rate would predict
- Across all top 20 rules, Pipeline (Oil) appears as the dominant antecedent facility type and cleanup requirement is the single strongest co-occurring operational attribute
Monte Carlo Simulation (Section 6)
The simulation quantifies the financial consequence of this pattern:
- P50 annual remediation cost: $17,136,316
- P90 conservative reserve: $18,603,017
- The gap of approximately $1.5 million between P50 and P90 is the contingency required to cover costs in 9 out of 10 years
- The tornado sensitivity analysis identifies per-incident remediation cost as the dominant driver of uncertainty. Better cost capture in Joint Investigation Visit (JIV) reports would directly narrow the P90 band in future planning cycles
Prophet Forecasting (Section 7)
The forecast provides the timing context:
- Mean RMSE of 15.41 incidents per month across walk-forward cross-validation folds reflects the highly variable and externally driven nature of pipeline sabotage in the Niger Delta
- The seasonal component confirms that incident frequency is not uniform across the year
- The six-month forward projection provides the capacity planning baseline for emergency response resourcing in the first half of 2026
Convergence visualisation
The chart below summarises how each technique contributes to the single integrated recommendation.
10.2 The Single Integrated Recommendation
The convergence across all five techniques is striking in its consistency:
- The LDA topic model identifies the language of physical interference at flowlines and valve points as the dominant narrative pattern
- The survival analysis identifies the Tebidaba and Clough Creek oil pipelines in SAR as the highest-recurrence assets in the portfolio with median inter-incident times of 2 to 3 days
- The association rules establish that oil pipeline incidents requiring cleanup are classified as sabotage in over 96% of cases
- The Monte Carlo simulation puts the P90 annual cost of this pattern at $18.6 million
- The Prophet forecast confirms the pattern is structural and ongoing rather than declining
These five outputs, derived independently from different analytical frameworks, all point to the same underlying operational reality: a concentrated, organised pattern of third-party interference targeting oil transmission pipelines in the SAR area, particularly the Tebidaba and Clough Creek corridors, that is generating the majority of OERNL’s incident volume, remediation cost exposure, and production deferral risk.
The 2026 operational plan should reflect this convergence in four specific actions:
| Priority | 2026 Action | Analytical Basis |
|---|---|---|
| 1 | Elevate the 18" Tebidaba/Brass Pipeline and 10" Clough Creek/Tebidaba Pipeline to mandatory Q1 2026 integrity inspection. Both facilities record median inter-incident times below 3 days across 130-plus repeat intervals | Survival analysis: median time to event = 2.0 days (Tebidaba/Brass, 131 intervals); 3.0 days (Clough Creek, 141 intervals). Cox hazard ratio confirms Pipeline (Oil) in SAR as highest-hazard asset class |
| 2 | Hold $18,603,017 as the 2026 remediation contingency reserve in the annual operating budget submission to Finance and the Board. This is the P90 Monte Carlo output | Monte Carlo simulation: P50 = $17,136,316; P90 = $18,603,017. Tornado chart identifies per-incident cost as dominant uncertainty driver |
| 3 | Direct security patrol and community engagement resources to SAR-area oil pipelines during peak incident months identified by the Prophet seasonal component. Do not distribute resources uniformly across the portfolio | Association rules: top rule confidence 0.964, lift 1.305 for Pipeline (Oil) and cleanup requirement leading to Sabotage. LDA Topic 2 confirms person-valve-cut-theft as dominant interference narrative |
| 4 | Expand JIV cost capture to all sabotage incidents, not only the 51 currently recorded. This directly narrows the Monte Carlo P90 band and improves the defensibility of future budget submissions | Monte Carlo limitation: cost distribution fitted from only 51 cost records. Wider cost coverage reduces P90 uncertainty and strengthens the capital plan |
The four actions above are not independent recommendations from four separate analyses. They are a single, integrated operational response to a single, concentrated problem that five different analytical lenses have independently identified and quantified. The Tebidaba and Clough Creek corridors are the highest-cost, highest-frequency, highest-sabotage-risk assets in the OERNL portfolio. The 2026 plan should treat them accordingly.
11. Limitations and Further Work
11.1 Data Limitations
The analyses in this submission were conducted on internally sourced operational data that carries several structural limitations. Each limitation is documented below with its specific impact on the analytical outputs and the remediation action that would address it.
Sparse cost coverage
Of the 1,695 incident records in the consolidated dataset, only 51 contain a populated cost_usd field, predominantly from sabotage records in 2019 and 2021. This means the Monte Carlo simulation in Section 6 was fitted from a Gamma distribution calibrated on fewer than 4% of all incidents.
- The remaining 96% of incidents have no recorded remediation cost, creating a sparse and potentially unrepresentative fitting sample
- The single most impactful data improvement available to the department is a standing requirement for JIV teams to record a cost estimate for every incident, not only major sabotage events
- This one change would directly narrow the P90 band and improve the defensibility of all future budget submissions to Finance and the Board
Inconsistent facility categorisation
Table SA2 shows that several of the highest-recurrence facilities, including the top-ranked 24” Ogoda-Brass Pipeline and multiple Tebidaba corridor assets, carry no value in the facility_cat field.
- Facilities without a facility category were excluded from Cox model stratification and the facility category layer of the association rules
- The hazard ratios reported in Table SA1 are therefore biased toward the better-documented asset classes, understating the true differential between categorised and uncategorised facilities
- A systematic audit of the facility category field across the incident register would immediately improve the precision of both outputs
Cause field granularity
The raw cause field contains free-text entries of highly variable length, from single words like “Corrosion” to full sentences describing the sequence of events.
- Truncating to four words for the association rule transactions reduced cardinality but lost meaningful distinctions between cause sub-types
- A structured cause taxonomy with a controlled vocabulary of 15 to 20 canonical cause categories, applied consistently at point of entry, would make both the TF-IDF analysis and the association rules significantly more discriminating
- This is a data governance change, not a technical one, and requires no additional tooling to implement
Deferred production data
Deferred production volume was available for only 51 records, the same sparse subset as cost data.
- The Monte Carlo simulation was constrained to remediation cost alone and could not model the full financial exposure including production deferral
- Incorporating deferred production barrels multiplied by the prevailing oil price would give a materially more complete picture of the true P90 financial exposure
- The Board-level contingency reserve figure should eventually reflect total financial exposure, not remediation cost alone
No external covariates for forecasting
The Prophet model in Section 7 captures average seasonal patterns in incident frequency but cannot distinguish between a seasonally elevated period driven by weather versus one driven by community unrest.
- External covariates including rainfall data from NIMET, community tension indices, and DPR inspection schedules were not available for integration as Prophet regressors
- Their absence means the seasonal component blends structurally distinct drivers into a single undifferentiated pattern
- Separating these signals would improve both forecast accuracy and the operational specificity of the capacity planning recommendation
11.2 Analytical Limitations
Beyond data quality, four analytical assumptions constrain the interpretation of the outputs.
The bag-of-words assumption in text analytics
The LDA topic model discards word order and grammatical structure entirely.
- For incident narratives where the subject-verb-object structure carries the operational meaning, a dependency-parsed or transformer-based model such as a fine-tuned BERT variant would extract substantially richer causal intelligence
- With access to a GPU environment, replacing the LDA pipeline with a BERTopic or Named Entity Recognition model trained on oil and gas incident text would be the natural analytical next step
The proportional hazards assumption in survival analysis
The Cox model was tested via Schoenfeld residuals, and the p-values reported in Section 8 indicate potential violation for some covariates.
- Pipeline operating conditions in the Niger Delta changed materially between 2019 and 2025. The 2020 oil price shock, COVID-19 movement restrictions, and subsequent recovery each created structural breaks that a time-constant hazard ratio cannot fully capture
- A time-varying coefficient Cox model, or a parametric accelerated failure time model with explicit changepoint detection, would more honestly represent the non-stationarity in the recurrence process
Association rules are descriptive, not causal
The highest-lift rule establishes a strong co-occurrence pattern but does not prove causation.
- Validating the rules against an independent holdout dataset, or against a separate year of incident records not used in model training, would distinguish genuine operational patterns from documentation artefacts
- Until validated, the top rules should be treated as priority hypotheses for field investigation rather than confirmed operational facts
The Prophet forecast carries high uncertainty at monthly level
The mean RMSE of 15.41 incidents per month and MAPE of 114.7% reflect the fundamental unpredictability of sabotage-driven incident counts.
- Unlike demand forecasting in retail or manufacturing where Prophet was originally designed, pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can capture from incident frequency alone
- The model is useful for annual-level capacity planning and seasonal pattern identification but should not be used to make precise monthly staffing decisions without an explicit uncertainty buffer
11.3 Further Work
The table below summarises what would be done differently with more data, time, or computing power across each of the five techniques.
| Technique | Current Limitation | Proposed Further Work | Resource Required |
|---|---|---|---|
| Text Analytics | LDA bag-of-words loses word order and semantic context in short incident narratives | Replace LDA with BERTopic or fine-tuned BERT Named Entity Recognition on GPU. Extract structured cause entities directly from narrative text | GPU compute environment. Labelled training corpus of 500-plus incident narratives |
| Monte Carlo Simulation | Cost distribution fitted from only 51 records. 96% of incidents have no cost data | Mandate JIV cost capture for all incidents. Refit Gamma distribution on full portfolio. Extend simulation to include deferred production value | Operational policy change requiring mandatory cost field in incident register from 2026 |
| Advanced Forecasting | MAPE of 114.7% reflects unpredictability of sabotage-driven counts without external covariates | Add external regressors to Prophet — NIMET rainfall, community tension index, crude price — to separate structural from externally triggered spikes | Data sharing agreement with NIMET. Access to Niger Delta Security Tracker |
| Survival Analysis | Cox model assumes time-constant hazard ratios across a 7-year window with structural breaks | Fit time-varying coefficient Cox model with explicit changepoint detection at 2020 and 2022 structural breaks | No additional data required. Analytical extension of current survival dataset |
| Association Rules | Rules are descriptive co-occurrences. Causal direction not validated against holdout data | Build gradient-boosted classifier using top rule antecedents as features. Moves from retrospective pattern detection to real-time sabotage risk flagging | 12 months of new incident records as holdout validation set |
References
Primary Data Source
Olanubi, W. (2026). OERNL pipeline incident register 2019–2025 [Dataset]. Compiled from Joint Investigation Visit reports and the HSE incident management system, Project Services Department, Oando Energy Resources Nigeria Limited, Lagos, Nigeria. Data available on request from the author.
Course Textbook
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Academic and Professional References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487–499). Morgan Kaufmann.
Carsey, T. M., & Harden, J. J. (2013). Monte Carlo simulation and resampling methods for social science (1st ed.). SAGE Publications.
Glasserman, P. (2003). Monte Carlo methods in financial engineering (Vol. 53). Springer.
Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: A self-learning text (3rd ed.). Springer. https://doi.org/10.1007/978-1-4419-6646-9
Marr, B. (2017). Key business analytics: The 60+ business analysis tools every manager needs to know. Pearson Education.
Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
R Software and Package Citations
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.5) [Computer software]. https://doi.org/10.5281/zenodo.5960048
Bürkner, P. C., Gabry, J., Kay, M., & Vehtari, A. (2022). posterior: Tools for working with posterior distributions (R package). https://mc-stan.org/posterior/
Firke, S. (2023). janitor: Simple tools for examining and cleaning dirty data (R package version 2.2.0). https://CRAN.R-project.org/package=janitor
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://doi.org/10.18637/jss.v040.i03
Grun, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. https://doi.org/10.18637/jss.v040.i13
Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. https://doi.org/10.18637/jss.v014.i15
Hahsler, M. (2017). arulesViz: Interactive visualization of association rules with R. R Journal, 9(2), 163–175. https://doi.org/10.32614/RJ-2017-047
Kassambara, A., & Kosinski, M. (2021). survminer: Drawing survival curves using ggplot2 (R package version 0.4.9). https://CRAN.R-project.org/package=survminer
Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble
Pedersen, T. L. (2022). patchwork: The composer of plots (R package version 1.1.2). https://CRAN.R-project.org/package=patchwork
Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com
Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in R. Journal of Open Source Software, 1(3), 37. https://doi.org/10.21105/joss.00037
Spinu, V., Grolemund, G., & Wickham, H. (2023). lubridate: Make dealing with dates a little easier (R package version 1.9.3). https://CRAN.R-project.org/package=lubridate
Taylor, S. J., & Letham, B. (2021). prophet: Automatic forecasting procedure (R package version 1.0). https://CRAN.R-project.org/package=prophet
Therneau, T. M. (2023). A package for survival analysis in R (R package version 3.5-7). https://CRAN.R-project.org/package=survival
Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model. Springer. https://doi.org/10.1007/978-1-4757-3294-8
Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations (R package version 1.5.0). https://CRAN.R-project.org/package=stringr
Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (R package version 1.4.3). https://CRAN.R-project.org/package=readxl
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation (R package version 1.1.4). https://CRAN.R-project.org/package=dplyr
Wickham, H., & Henry, L. (2023). purrr: Functional programming tools (R package version 1.0.2). https://CRAN.R-project.org/package=purrr
Wickham, H., Vaughan, D., & Girlich, M. (2023). tidyr: Tidy messy data (R package version 1.3.0). https://CRAN.R-project.org/package=tidyr
Zeileis, A., & Grothendieck, G. (2005). zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6), 1–27. https://doi.org/10.18637/jss.v014.i06
Python Package Citations
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Rasband, D. (2021). vaderSentiment: VADER sentiment analysis (Python package version 3.3.2). https://github.com/cjhutto/vaderSentiment
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference (pp. 92–96). https://doi.org/10.25080/Majora-92bf1922-011
Taylor, S. J., & Letham, B. (2021). Prophet: Forecasting at scale (Python package version 1.1). https://facebook.github.io/prophet/
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Waskom, M. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021
Appendix: AI Usage Statement
Claude (Anthropic, claude.ai) was used as an AI coding assistant throughout this submission. Specifically, Claude assisted with the generation of R and Python code chunks for data loading, cleaning, and transformation in the setup chunk; the construction of ggplot2 and matplotlib visualisations; the implementation of the LDA topic model, Monte Carlo simulation, Prophet forecasting, Kaplan-Meier and Cox Proportional Hazards survival analysis, and Apriori association rule mining pipelines; and the debugging of rendering errors encountered during Quarto document compilation. The document structure, YAML configuration, and inline prose were also developed with Claude’s assistance as a drafting and formatting tool.
Independent analytical judgement was exercised in the following areas:
- Case study selection: Case Study 3 was selected as the most appropriate case for my professional context, given direct access to seven years of internal pipeline incident data from OERNL’s HSE management system
- Central analytical question: The framing of the research question and the mapping of each technique to a specific 2026 operational decision were determined independently, based on knowledge of the planning decisions facing the Project Services Department heading into 2026
- Model output interpretation: All model outputs were interpreted in the context of OERNL’s operational reality, including the identification of the Tebidaba and Clough Creek corridors as the priority intervention assets and the translation of the P90 Monte Carlo figure into a specific Board-level budget recommendation
- LDA topic naming: The five LDA topics were named and operationally interpreted based on domain knowledge of Niger Delta pipeline operations, not on algorithmic output alone. The algorithm surfaces word clusters; the operational meaning of those clusters was determined by the author
- Data quality identification: The identification of sparse cost coverage, inconsistent facility categorisation, and cause field granularity as the primary data limitations was based on direct familiarity with the source system and its known recording practices
- Integrated recommendation: The recommendation in Section 10 draws on seven years of direct professional experience in the Project Services Department to connect the analytical outputs to live planning decisions facing the organisation in 2026
All analytical decisions, business interpretations, and operational conclusions are my own.