Pipeline Incident Analytics: Operational Risk, Cost Exposure and Predictive Patterns at a Nigerian Upstream Operator (2019-2025)

Author

Olawale Unimke Olanubi

Published

May 12, 2026

1. Executive Summary

Sabotage accounts for 74% of pipeline incidents at Oando Energy Resources Nigeria Limited, yet the financial and operational consequences have never been systematically quantified. This submission applies five analytical techniques to 1,695 incidents recorded across 446 facilities in Nigeria’s Niger Delta between 2019 and 2025, drawn from OERNL’s internal HSE incident register and Joint Investigation Visit reports, to answer one question: what should the Project Services Department prioritise in 2026?

Key findings across the five techniques:

Text analytics identified deliberate physical interference at flowlines and valve connections as the dominant latent narrative theme, invisible in the structured cause field
Monte Carlo simulation quantified the P90 annual remediation exposure at $18,603,017
Forecasting projected monthly incident volumes with a cross-validation RMSE of 15.41 incidents per month, providing the 2026 emergency response capacity baseline
Survival analysis identified the Tebidaba/Brass and Clough Creek/Tebidaba pipelines as highest-recurrence assets, with median inter-incident times of 2 and 3 days across 130-plus intervals each
Association rules confirmed oil pipeline incidents requiring cleanup predict sabotage with 96.4% confidence and lift of 1.305

Important

Recommendation: Concentrate 2026 integrity inspection, surveillance, and community engagement on SAR-area oil pipelines, with priority on the Tebidaba and Clough Creek corridors, and hold $18.6 million as the Board-level remediation contingency reserve.

2. Professional Disclosure

2.1 My Role and Organisation

I am a Project Services Manager in the Project Services Department (PSD), which sits within the Project Development and Infrastructure (PDI) division of Oando Energy Resources Nigeria Limited (OERNL), a Nigerian upstream oil and gas company operating across exploration, development, and production assets in the Niger Delta.

Organisational structure of the Project Development and Infrastructure division

My role is cross-functional, sitting at the intersection of project delivery and operational risk management. My core responsibilities span four areas:

Project controls and scheduling across OERNL’s active capital and infrastructure projects
Cost management and budget reporting, including inputs to Board-level capital expenditure (CAPEX) reviews
Operational performance reporting, covering project progress, milestone tracking, and variance analysis
Pipeline incident data analysis, used to inform project prioritisation, maintenance scheduling, risk budgeting, and regulatory compliance reporting

The dataset underpinning this submission was compiled directly from OERNL’s internal pipeline incident registers, covering incidents recorded across our operational fields from 2019 to 2025. Analysing this data is not an academic exercise. For me it is a routine input into the decisions my department makes about which infrastructure projects to prioritise and how to size contingency budgets.

2.2 Why These Five Techniques Are Operationally Relevant

Technique 1: Text Analytics and Sentiment Analysis

Each pipeline incident record contains a free-text description field in which field engineers and operations personnel narrate the cause, sequence of events, and immediate corrective action taken. These narrative fields are currently not systematically analysed in our reporting practice. They are read selectively during incident reviews but never mined at scale.

Applying text analytics to this corpus serves three operational purposes:

Pattern extraction: Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words models surface the most distinctive terms associated with high-cost or high-frequency incidents
Causal clustering: Latent Dirichlet Allocation (LDA) topic modelling groups incidents by underlying cause type, such as corrosion, third-party interference, or mechanical fatigue, in a way that manual categorisation misses
Scope intelligence: Understanding which failure causes are trending upward changes both what capital projects we initiate and how we rank them in the project pipeline

Note

Operational link: The text analytics output feeds directly into root-cause analysis and capital project scope definition, surfacing failure themes that no structured field in the incident register currently captures.

Technique 2: Monte Carlo Simulation

Capital project cost estimation at OERNL is inherently uncertain. Procurement lead times vary, contractor mobilisation costs fluctuate with foreign exchange movements, and incident remediation costs depend on the severity and duration of the event.

In the Project Services Department, we are required to present 50th percentile (P50) and 80th percentile (P80) cost estimates to management for budgeting and Board reporting purposes. Monte Carlo simulation addresses this in three ways:

Distribution fitting: Historical incident cost data is used to fit probability distributions to each uncertain input, replacing assumptions of normality with evidence-based parameters
Probabilistic output: Running 10,000 simulations produces a full distribution of possible total cost outcomes, from which 10th percentile (P10), P50, and P80 values are extracted
Contingency justification: The gap between P50 and P80 becomes the defensible contingency budget figure presented to the Board, grounded in data rather than rule-of-thumb percentages

Note

Operational link: The simulation output replaces single-point cost assumptions with a probability-weighted reserve figure. The P90 value becomes the Board-level contingency recommendation in the 2026 capital plan.

Technique 3: Advanced Forecasting with Prophet

Production uptime planning and maintenance scheduling at OERNL require forward visibility on incident frequency and associated downtime. With seven years of monthly incident data covering 2019 to 2025, the dataset contains sufficient temporal depth to support a robust forecasting model.

Prophet is the appropriate technique here for three reasons:

Trend and seasonality decomposition: Pipeline incidents in the Niger Delta exhibit seasonal patterns driven by weather, community activity cycles, and regulatory inspection windows, which Prophet captures natively
Actionable horizon: A three to six month forward projection of incident counts or cumulative downtime maps directly onto the lead time required for contractor mobilisation and materials procurement
Communicable uncertainty: Walk-forward cross-validation produces a Root Mean Square Error (RMSE) that I can present to PDI leadership as evidence of forecast reliability, replacing intuition-based planning with quantified confidence intervals

Note

Operational link: The six-month forecast directly determines the 2026 emergency response capacity plan. Manpower levels, containment boom inventory, and recovery vessel deployment are sized against the upper bound of the forecast confidence interval.

Technique 4: Asset Survival Analysis

Not all pipeline segments fail at the same rate or at the same age. Current prioritisation in our department relies heavily on engineering judgement and field reports, which introduces subjectivity into decisions that carry significant capital implications.

Kaplan-Meier survival analysis treats each pipeline segment as a subject and its first recorded failure as the event, producing empirical survival curves that quantify failure probability over time. This technique serves the following operational functions:

Replacement queue prioritisation: Segments with survival probability below a defined threshold within the next 12 months are flagged as candidates for the capital replacement programme
Business case support: Survival curves provide statistically defensible evidence for integrity intervention projects when presenting to Finance and the Board
Segmentation by asset class: Stratifying curves by location, pipeline type, or operating condition reveals which asset categories carry disproportionate risk, enabling targeted rather than blanket maintenance spend

Note

Operational link: The survival output produces a ranked list of facilities by recurrence hazard. The bottom quartile by median inter-incident time becomes the mandatory inclusion list for the 2026 Q1 integrity inspection programme.

Technique 5: Association Rules (Operational Co-occurrence Analysis)

Incident records contain multiple categorical fields including location, cause category, incident type, affected asset class, and response action taken. These fields are currently reviewed in isolation. Association rule mining using the Apriori algorithm examines all combinations simultaneously, revealing which attributes co-occur more frequently than chance would predict.

The operational value of this technique is threefold:

Co-occurrence intelligence: Rules such as “corrosion at Field X and pipeline age above 15 years implies production deferment exceeding 48 hours” with high confidence and lift provide actionable scheduling inputs that no single-variable analysis would surface
Project scope refinement: If a high-lift rule links a specific cause type to a particular asset class, that finding directly informs the scope of the corresponding capital maintenance project
Protocol design: Frequent itemsets linking incident types to response actions allow the department to design more targeted emergency response protocols, reducing average resolution time and cost per incident

Note

Operational link: The top association rules by lift define the specific cause, location, and facility type combinations that warrant elevated security presence and proactive community liaison in the 2026 security and community engagement targeting strategy.

3. Data Collection & Sampling

3.1 Data Source

The dataset for this submission was drawn from two complementary internal records maintained by the Production Health, Safety and Environment (HSE) and Field Operations department at Oando Energy Resources Nigeria Limited (OERNL):

The operator’s HSE incident register: A structured internal log updated by field personnel following every reportable spill, leak, or pipeline breach across all operational assets. Each entry captures incident identity, location, cause, category, spill volumes, deferred production, and remediation cost where applicable
Joint Investigation Visit (JIV) reports: Formal post-incident investigation documents completed by a joint team of operator and regulatory representatives following significant incidents. JIV reports provide the narrative descriptions used in the text analytics technique and supplement quantitative fields in the register with verified cause classifications and cost estimates

Both sources feed into a single consolidated incident management system maintained at the asset level and archived annually in structured Excel workbooks, one per calendar year.

Note

Primary Source: OERNL Production HSE and Field Operations Department. Both the HSE incident register and Joint Investigation Visit reports were accessed directly by the author in the capacity of Project Services Manager.

3.2 Collection Method

The data was personally extracted and compiled by the author directly from OERNL’s internal incident management system in the capacity of Project Services Manager within the Project Development and Infrastructure (PDI) division. The extraction process involved the following steps:

File retrieval: Seven annual Excel workbooks covering 2019 to 2025 were retrieved from the organisational records system, each containing one or more sheets corresponding to reporting periods or asset groupings within that year
Consolidation: All sheets across all seven files were programmatically merged in R using the readxl and purrr packages, with a custom column harmonisation routine applied to reconcile the 28 variables across years where field names had changed between reporting cycles
Standardisation: Date fields were parsed using a multi-format date parser to handle inconsistencies across years. Categorical fields including cause, category, area, and closeby facility were standardised using regular expression matching to collapse variant spellings into canonical values
Validation: The consolidated dataset was cross-checked against known annual totals and spot-checked against source files to confirm row counts and key field integrity

Data collection and consolidation workflow

Note

Collection approach: Direct extraction from the internal HSE system by the author. All seven annual workbooks were personally retrieved, merged, standardised, and validated before analysis.

3.3 Sampling Frame

This study uses a census rather than a sample. The sampling frame comprises every pipeline spill and non-leak incident reported under OERNL’s HSE management system across all operational assets from January 2019 to the most recent completed reporting period in 2025. No probabilistic sampling was applied because the objective is to analyse the full population of operational incidents rather than generalise from a subset.

Key parameters of the sampling frame are summarised below:

Table S1: Sampling Frame Parameters
Parameter	Detail
Population	All pipeline incidents reported under the OERNL HSE management system
Sampling approach	Census — full enumeration of all reportable incidents within scope
Inclusion criteria	All spill and non-leak incidents with a recorded incident date from 2019 to 2025
Exclusion criteria	Incidents with no recorded date, duplicate reference numbers, and test or placeholder rows removed during cleaning
Geographic scope	All OERNL operational assets across Bayelsa, Rivers, Delta, Imo, and Edo States
Temporal scope	January 2019 to December 2025 (seven complete calendar years)
Unit of observation	Individual pipeline incident record

Note

Sampling approach: Census — full enumeration of all reportable incidents within scope. No probabilistic sampling was required because the analytical objective is to characterise the full incident population, not to estimate population parameters from a subset.

3.4 Sample Size and Statistical Rationale

The initial extraction yielded approximately 1,750 records across the seven annual workbooks. Following removal of rows with no incident date, duplicate reference numbers, and structurally empty rows introduced by the source formatting, the cleaned dataset comprises 1,695 incidents across 446 unique facilities.

This sample size comfortably satisfies the minimum requirements for all five analytical techniques applied in this submission:

Table S2: Minimum Data Requirements vs Available Data
Technique	Minimum Required	Available in Dataset	Status
Text Analytics and Sentiment Analysis	50 text documents	1,695 incident description text fields	Met
Monte Carlo Simulation	3 or more uncertain inputs with historical data	1,695 cost and spill volume records across 7 years	Met
Advanced Forecasting with Prophet	36 monthly time periods	84 monthly periods (January 2019 to December 2025)	Met
Asset Survival Analysis	100 or more observations with event variable	1,695 facility-level incident records	Met
Association Rules	Sufficient transactions for rule support thresholds	1,695 multi-attribute incident records	Met

Note

Final sample: 1,695 incidents after cleaning, drawn from an initial extract of approximately 1,750 records across seven annual workbooks. All five technique minimum data requirements are met with significant margin.

3.5 Time Period Covered

The dataset spans seven complete calendar years from January 2019 to December 2025. This temporal range was selected for three reasons:

Analytical sufficiency: 84 monthly periods exceed the minimum 36 periods required by Prophet for reliable trend and seasonality decomposition, and provide sufficient historical depth for walk-forward cross-validation
Operational relevance: The period captures the full post-2019 operating environment at OERNL, including the production disruption effects of the 2020 oil price shock and the subsequent recovery phase, both of which are visible in the incident frequency trend
Regulatory alignment: OERNL’s HSE reporting cycle is structured annually, and the 2019 start point corresponds to the first year for which all seven reporting fields used in this analysis were consistently populated across all assets

Annual incident count 2019 to 2025 — temporal coverage of the dataset

Note

Coverage: January 2019 to December 2025, spanning seven complete calendar years and 84 monthly periods. This temporal depth supports all five analytical techniques including the Prophet forecasting model which requires a minimum of 36 monthly observations.

4. Data Description

The consolidated dataset contains 1,695 pipeline incidents recorded across seven years (2019 to 2025), drawn from 446 unique facilities across five states in Nigeria’s Niger Delta region. The dataset comprises 28 variables capturing incident identity, geography, timing, cause, category, spill volumes, and remediation cost. Key numeric variables show significant right skew: spill volumes range from 0 to 5,054 barrels with a median of just 1 barrel, reflecting the dominance of small-scale theft taps alongside rare catastrophic spills. Cost data is available for only 51 incidents, predominantly from sabotage records in 2019 and 2021, a limitation acknowledged in Section 11.

Geographically, incidents are concentrated in Bayelsa and Rivers States, which together account for over 87% of all records. Sabotage is the dominant incident category at 74% of all cases, with Oil Theft and Hacksaw Cut the two most frequent causes, consistent with the chronic third-party interference challenges facing upstream operators in the Niger Delta. The tables below summarise the key dimensions of the dataset.

Code

state_data <- incidents |>
  dplyr::filter(!is.na(state)) |>
  dplyr::count(state, name = "incidents") |>
  dplyr::mutate(
    state     = forcats::fct_reorder(state, incidents),
    highlight = ifelse(state %in% c("Bayelsa", "Rivers"), "Top 2", "Other")
  )

ggplot(state_data, aes(x = incidents, y = state, fill = highlight)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = scales::comma(incidents)),
            hjust = -0.2, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Top 2" = "#2c7fb8", "Other" = "#a8c8e8")) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Pipeline Incidents by State (2019 to 2025)",
    subtitle = "Bayelsa and Rivers States account for over 87% of all recorded incidents",
    x        = "Number of Incidents",
    y        = NULL,
    fill     = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "bottom",
    plot.title         = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

Incident count by state — Bayelsa and Rivers together account for over 87% of all records

Code

category_data <- incidents |>
  dplyr::filter(!is.na(category)) |>
  dplyr::count(category, name = "count") |>
  dplyr::mutate(
    pct       = count / sum(count),
    category  = forcats::fct_reorder(category, count),
    highlight = ifelse(category == "Sabotage", "Sabotage", "Other")
  )

ggplot(category_data, aes(x = count, y = category, fill = highlight)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(scales::comma(count),
                               "  (", scales::percent(pct, accuracy = 1), ")")),
            hjust = -0.1, size = 3.2, fontface = "bold") +
  scale_fill_manual(values = c("Sabotage" = "#e05c2a", "Other" = "#a8c8e8")) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Incident Category Distribution (2019 to 2025)",
    subtitle = "Sabotage dominates, comprising the majority of all 1,695 recorded incidents",
    x        = "Number of Incidents",
    y        = NULL,
    fill     = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "bottom",
    plot.title         = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

Incident category breakdown — Sabotage accounts for 74% of all cases

Code

ggplot(incidents |> dplyr::filter(!is.na(qty_spilled_bbl) & qty_spilled_bbl > 0),
       aes(x = qty_spilled_bbl)) +
  geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white", linewidth = 0.3) +
  geom_vline(xintercept = 1, colour = "#e05c2a", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = 80, y = Inf, vjust = 1.5,
           label = "Median = 1 barrel", colour = "#e05c2a",
           fontface = "italic", size = 3.5) +
  scale_x_continuous(
    labels = scales::comma,
    trans  = "log1p",
    breaks = c(0, 1, 10, 100, 1000, 5000)
  ) +
  labs(
    title    = "Distribution of Spill Volumes (Non-Zero Incidents)",
    subtitle = "Log scale reveals extreme right skew — median of 1 barrel alongside spills exceeding 5,000 barrels",
    x        = "Spill Volume (Barrels, log scale)",
    y        = "Number of Incidents"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Spill volume distribution — heavy right skew driven by rare catastrophic events

Table 1: Incidents by Year (2019 to 2025)
Year	Incidents
2019	480
2020	174
2021	287
2022	232
2023	159
2024	238
2025	125

Table 2: Incidents by Category
Category	Incidents
Sabotage	1252
Operational Control	121
Other	51

Table 3: Top 15 Incidents by Category and Cause
Category	Cause	Incidents
Sabotage	Oil Theft	561
Sabotage	Hacksaw Cut	421
Sabotage	Vandalization	89
Operational Control	Equipment Failure	77
Sabotage	Drilled Hole	70
Sabotage	Use Of Explosive	48
Operational Control	Corrosion	38
Sabotage	Oiltheft	19
Other	False Alarm	16
Other	No Spill	16
Sabotage	Sabotage	14
Other	Others	7
Sabotage	Hacksaw Cut & Explosive	6
Other	Facility Submerged	5
Sabotage	Others	4

Table 4: Top States by Incident Count
State	Incidents
Bayelsa	905
Rivers	644
Delta	63
Imo	32

Table 5: Top 8 Facility Types
Facility Type	Incidents
Pipeline (Oil)	440
Flowline	265
Well Head	24
Pipeline (Gas)	12
Manifold	5
Tank	4
Flowstation	3
Sbm	2

Table 6: Incidents by Area (SAR and LAR)
Area	Incidents
LAR	826
SAR	818

Table 7: Top 15 Incidents by Area and Closeby Facility
Area	Closeby Facility	Incidents
SAR	TEBIDABA	424
LAR	OB/OB	389
LAR	OSHIE	203
SAR	CLOUGH CREEK	135
SAR	BRASS TERMINAL	86
LAR	KWALE/OKPAI	49
SAR	OGBOINBIRI	47
SAR	OBAMA	39
LAR	IDU	29
LAR	EOC/EBOCHA	25
LAR	AKRI	18
LAR	IRRI	7
SAR	BENIBOYE	2
LAR	OZOCHI RISER	1
LAR	PHC BASE	1

4.1 Geographic Distribution of Incidents

The geographic concentration of incidents is one of the most operationally significant features of the dataset. With 88.1% of records carrying facility-level coordinates, the dataset supports direct spatial visualisation of where incidents occur, how they cluster by category, and whether the geographic pattern has shifted between 2019 and 2025. The interactive map below allows each incident to be examined individually. The faceted static map provides a year-by-year comparison of the spatial distribution across the full seven-year period.

Code

parse_dms <- function(x) {
  x       <- as.character(x)
  x       <- stringr::str_trim(x)
  hemi    <- stringr::str_extract(x, "^[NSEW]")
  degrees <- as.numeric(stringr::str_extract(x, "(?<=[NSEW]\\s)\\d+(?=°)"))
  minutes <- as.numeric(stringr::str_extract(x, "(?<=°\\s)\\d+(?=')"))
  seconds <- as.numeric(stringr::str_extract(x, "(?<='\\s)\\d+\\.?\\d*(?='')"))
  decimal <- degrees + minutes / 60 + seconds / 3600
  decimal <- ifelse(hemi %in% c("S", "W"), -decimal, decimal)
  decimal
}

incidents_mapped <- incidents %>%
  filter(!is.na(lat_raw), !is.na(long_raw)) %>%
  mutate(
    lat  = parse_dms(lat_raw),
    long = parse_dms(long_raw)
  ) %>%
  filter(
    !is.na(lat), !is.na(long),
    lat  > 3 & lat  < 7,
    long > 4 & long < 9
  )

Code

library(leaflet)
library(leaflet.extras)

cat_levels <- incidents_mapped %>%
  count(category, sort = TRUE) %>%
  pull(category)

pal <- colorFactor(
  palette = c("#e05c2a", "#2c7fb8", "#27ae60", "#9b59b6",
              "#e67e22", "#1abc9c", "#e74c3c", "#95a5a6"),
  domain  = cat_levels
)

incidents_mapped <- incidents_mapped %>%
  mutate(
    radius = case_when(
      is.na(qty_spilled_bbl) | qty_spilled_bbl == 0 ~ 4,
      qty_spilled_bbl < 10   ~ 5,
      qty_spilled_bbl < 100  ~ 7,
      qty_spilled_bbl < 500  ~ 10,
      TRUE                   ~ 14
    ),
    popup_text = glue::glue(
      "<b>Facility:</b> {facility}<br>",
      "<b>Date:</b> {format(date_incident, '%d %b %Y')}<br>",
      "<b>Category:</b> {category}<br>",
      "<b>Cause:</b> {cause}<br>",
      "<b>Spill Volume:</b> {ifelse(is.na(qty_spilled_bbl), 'Not recorded',
        paste0(round(qty_spilled_bbl, 1), ' bbls'))}<br>",
      "<b>State:</b> {state}<br>",
      "<b>Area:</b> {area}"
    )
  )

leaflet(incidents_mapped) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    lng            = ~long,
    lat            = ~lat,
    radius         = ~radius,
    color          = ~pal(category),
    fillColor      = ~pal(category),
    fillOpacity    = 0.7,
    stroke         = TRUE,
    weight         = 0.8,
    opacity        = 0.9,
    popup          = ~popup_text,
    clusterOptions = markerClusterOptions(
      showCoverageOnHover = FALSE,
      zoomToBoundsOnClick = TRUE,
      maxClusterRadius    = 40
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal      = pal,
    values   = ~category,
    title    = "Incident Category",
    opacity  = 0.9
  ) %>%
  addScaleBar(position = "bottomleft") %>%
  setView(lng = 6.2, lat = 4.9, zoom = 8)

Geographic distribution of pipeline incidents (2019 to 2025) — click any point for full incident details

Code

incidents_mapped %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = long, y = lat, colour = category)) +
  geom_point(alpha = 0.5, size = 0.9) +
  facet_wrap(~ year, ncol = 4) +
  scale_colour_manual(
    values = c(
      "Sabotage"            = "#e05c2a",
      "Operational Control" = "#2c7fb8",
      "TBD"                 = "#95a5a6"
    ),
    na.value = "#bdc3c7"
  ) +
  coord_fixed(ratio = 1) +
  labs(
    title    = "Geographic Distribution of Pipeline Incidents by Year (2019 to 2025)",
    subtitle = "Each point represents one incident — colour indicates category",
    x        = "Longitude",
    y        = "Latitude",
    colour   = "Category"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    plot.title       = element_text(face = "bold"),
    legend.position  = "bottom",
    panel.grid.minor = element_blank(),
    axis.text        = element_text(size = 7)
  )

Annual incident distribution by geography (2019 to 2025) — year-by-year comparison of spatial concentration

4.2 Analytical Framework and Research Question

4.2.1 The Central Analytical Question

The OERNL pipeline incident dataset represents seven years of operational history across the Niger Delta, covering 1,695 recorded incidents at 446 unique facilities. In its raw form, this data is already used for regulatory reporting and periodic HSE reviews. What it has not been used for is systematic, evidence-based planning that moves beyond counting incidents and begins to answer the question of what the organisation should do differently in the year ahead.

As Project Services Manager, the decisions I face heading into the 2026 annual planning cycle span budget sizing, maintenance prioritisation, emergency response capacity, and security targeting. Each of these decisions is currently made on the basis of experience, engineering judgement, and partial data. This submission tests whether a structured analytical approach to the same dataset can produce more defensible, more precise, and more actionable answers.

The central question driving this analysis is:

Important

“What operational and risk management decisions should OERNL’s Project Services Department prioritise in 2026, based on a systematic analysis of the company’s 2019 to 2025 pipeline incident portfolio?”

Specifically:

(a) Root-cause intelligence: What latent failure themes are embedded in incident narrative descriptions that the structured cause field does not capture, and how should these themes reprioritise the 2026 operational risk register?
(b) Financial exposure: What cleanup and remediation budget reserve should be recommended at P90 confidence level, accounting for the full uncertainty in historical cost and production deferral data?
(c) Incident volume outlook: What is the projected monthly incident volume over the next six months, and what does this imply for 2026 emergency response capacity in terms of manpower, containment equipment, and recovery vessel deployment?
(d) Facility risk prioritisation: Which facilities exhibit the shortest expected time to next incident, and which should therefore be elevated to the top quartile of the 2026 integrity inspection schedule?
(e) Sabotage pattern detection: Which combinations of cause type, location, and facility category are most strongly associated with sabotage classification, and how should these combinations inform the 2026 security and community engagement targeting strategy?

This question is not resolvable by a single technique. It requires five distinct analytical lenses, each addressing a specific dimension of the 2026 planning problem. The answers are not independent. They are designed to converge in Section 10 into a single, integrated operational recommendation.

4.2.2 Technique-to-Decision Mapping

The five sub-questions that structure this analysis each correspond to a concrete planning decision facing the Project Services Department. The table below makes this mapping explicit and defines the standard against which all outputs in Sections 5 to 9 should be read.

Table F1: Analytical Framework — Technique to Decision Mapping
Section	Technique	Sub-Question	2026 Operational Decision
Section 5	Text Analytics and Sentiment Analysis	(a)	Surface latent root-cause themes in incident narrative fields that are missed by the structured cause field. Outputs feed directly into the operational risk register prioritisation for 2026
Section 6	Monte Carlo Simulation	(b)	Recommend the cleanup and remediation budget reserve at the 90th percentile (P90) confidence level, including production deferral exposure, for inclusion in the 2026 capital plan
Section 7	Advanced Forecasting with Prophet	(c)	Generate a six-month forward forecast of monthly incident volume to size the 2026 emergency response capacity — manpower, containment boom inventory, and recovery vessel deployment
Section 8	Asset Survival Analysis	(d)	Identify the top quartile of facilities by hazard rate (shortest expected time to next incident) for prioritised integrity inspection scheduling in the 2026 maintenance programme
Section 9	Association Rules	(e)	Find cause, location, and facility category rule combinations with high lift on Sabotage classification. Outputs feed the security and community engagement targeting strategy

4.2.3 Structure of the Analytical Sections

The five technique sections that follow are written as a professional analytical report, not as a sequence of independent exercises. Each section is organised into four parts to support comparability across techniques and to ensure that every output is connected to a business decision:

Technique overview — a concise explanation of the method, its statistical foundations, and the conditions under which it is appropriately applied. This part establishes that the technique is being applied correctly and with an understanding of its assumptions and limitations.

Operational justification — an explanation of why this technique is the right tool for the specific business question it addresses, given the structure and characteristics of the OERNL incident dataset. This part demonstrates that the choice of technique was deliberate and data-driven, not arbitrary.

Analysis and outputs — fully reproducible R and Python implementations presented in parallel tabsets, with all charts, model outputs, and diagnostic results rendered inline. Code is visible and structured for reproducibility and all outputs are captioned and interpreted.

Management interpretation — a plain-language reading of the key findings, written for a non-technical operations manager or Board member. Every result is restated in terms of what it means for a specific 2026 planning decision, without reference to statistical jargon. This is the part that connects analytical output to organisational action.

The integrated findings in Section 10 draw all five outputs together into a single prioritised recommendation for the 2026 operating plan, addressing the central analytical question in full.

5. Technique 1: Text Analytics and Sentiment Analysis

5.1 Technique Overview

Text analytics, as described by Marr (2017), is the process of extracting value from large quantities of unstructured text data, transforming narrative content that does not fit neatly into rows and columns into structured, queryable intelligence. The technique encompasses five core tasks: text categorisation, text clustering, concept extraction, sentiment analysis, and document summarisation. In this analysis, three of these tasks are applied:

Term Frequency-Inverse Document Frequency (TF-IDF) for concept extraction
Latent Dirichlet Allocation (LDA) for topic clustering
AFINN and VADER (Valence Aware Dictionary and Sentiment Reasoner) lexicon scoring for sentiment analysis

The statistical foundations of these methods rest on two principles. TF-IDF weights a term by how frequently it appears in a document relative to how rarely it appears across all documents — terms that are distinctive to a subset of records score highly, while common words score near zero. LDA is a generative probabilistic model that assumes each document is a mixture of latent topics, and each topic is a distribution over words; the algorithm infers both distributions simultaneously from the observed word counts. Sentiment scoring assigns polarity values to words or phrases using pre-built lexicons. AFINN assigns integer scores from negative five to positive five to over 2,400 English words, while VADER is specifically calibrated for short, informal text and returns a compound score between negative one and positive one.

The key assumption shared by all three methods is the bag-of-words representation: word order is discarded and only word frequency is retained. This is a simplification that loses grammatical structure but is computationally tractable and well-suited to the short, formulaic language typical of incident reports.

5.2 Operational Justification

The OERNL incident dataset contains a free-text description field populated by field engineers and operations personnel at the time of each incident. This field is not currently analysed systematically. It is reviewed selectively during post-incident investigations but never mined across the full 1,695-record corpus.

Three specific gaps in the existing structured fields make text analytics necessary here:

The cause field is a post-hoc classification, assigned after investigation and subject to categorisation bias. The description field captures the raw narrative before any classification decision is made, and may therefore contain causal signals that the structured field obscures or aggregates away
LDA can surface latent themes that do not correspond to any existing category in the cause or category fields. Operational failures that are described differently across years but share an underlying mechanism are only visible at the corpus level
Sentiment trends in incident language provide a proxy for the severity and urgency of incidents over time that no numeric field in the dataset captures directly. A shift toward more negative, blame-heavy language may signal deteriorating operational conditions before they appear in cost or volume statistics

The output of this section feeds directly into sub-question (a) of the central analytical question: what latent root-cause themes should reprioritise the 2026 operational risk register?

5.3 Analysis and Outputs

Code

# ── Step 1: Build tidy token dataset ─────────────────────────────
incident_tokens <- incidents %>%
  filter(!is.na(description), nchar(description) > 15) %>%
  mutate(doc_id = row_number()) %>%
  dplyr::select(doc_id, year, category, description) %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"))

cat("Total tokens after cleaning:", nrow(incident_tokens), "\n")

Total tokens after cleaning: 5436

Code

cat("Unique documents with descriptions:", n_distinct(incident_tokens$doc_id), "\n")

Unique documents with descriptions: 861

Code

# ── Step 2: TF-IDF — top distinctive terms by category ───────────
tfidf <- incident_tokens %>%
  count(category, word, sort = TRUE) %>%
  bind_tf_idf(word, category, n) %>%
  arrange(desc(tf_idf))

top_tfidf <- tfidf %>%
  filter(!category %in% c("TBD", "Unknown")) %>%
  group_by(category) %>%
  slice_max(tf_idf, n = 8, with_ties = FALSE) %>%
  ungroup()

top_tfidf %>%
  mutate(word = reorder_within(word, tf_idf, category)) %>%
  ggplot(aes(tf_idf, word, fill = category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ category, scales = "free", ncol = 2) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Most Distinctive Terms in Incident Descriptions by Category",
    subtitle = "TF-IDF weighting surfaces terms unique to each incident category",
    x        = "TF-IDF Score",
    y        = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Code

# ── Step 3: Document-term matrix for LDA ─────────────────────────
doc_term <- incident_tokens %>%
  count(doc_id, word) %>%
  cast_dtm(doc_id, word, n)

cat("DTM dimensions:", dim(doc_term), "\n")

DTM dimensions: 861 282

Code

# ── Step 4: Coherence-based K selection ──────────────────────────
k_search <- FindTopicsNumber(
  doc_term,
  topics  = 2:8,
  metrics = c("CaoJuan2009", "Deveaud2014"),
  control = list(seed = 42),
  verbose = FALSE
)

FindTopicsNumber_plot(k_search)

Code

# ── Step 5: Fit LDA at optimal K ─────────────────────────────────
best_k <- 5
lda_model <- LDA(doc_term, k = best_k, control = list(seed = 42))

topic_terms <- tidy(lda_model, matrix = "beta") %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup()

topic_labels <- c(
  "1" = "Topic 1: Pipeline Breach\n& Physical Damage",
  "2" = "Topic 2: Production\nDeferral Impact",
  "3" = "Topic 3: Third-Party\nInterference",
  "4" = "Topic 4: Spill Response\n& Remediation",
  "5" = "Topic 5: Corrosion\n& Integrity Failure"
)

topic_terms %>%
  mutate(
    topic = factor(paste0("Topic ", topic), levels = paste0("Topic ", 1:best_k)),
    term  = reorder_within(term, beta, topic)
  ) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "LDA Topic Model: Top Terms per Topic",
    subtitle = glue::glue("K = {best_k} topics selected via CaoJuan2009 and Deveaud2014 coherence metrics"),
    x        = "Per-Topic Word Probability (Beta)",
    y        = NULL
  ) +
  theme_minimal(base_size = 10) +
  theme(plot.title = element_text(face = "bold"))

Code

# ── Step 6: Sentiment trend via AFINN ────────────────────────────
sentiment_by_year <- incidents %>%
  filter(!is.na(description), nchar(description) > 15) %>%
  mutate(sentiment = syuzhet::get_sentiment(description, method = "afinn")) %>%
  group_by(year) %>%
  summarise(
    mean_sent   = mean(sentiment, na.rm = TRUE),
    median_sent = median(sentiment, na.rm = TRUE),
    n           = n(),
    .groups     = "drop"
  )

ggplot(sentiment_by_year, aes(x = year)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") +
  geom_line(aes(y = mean_sent), linewidth = 1.1, colour = "#2c7fb8") +
  geom_point(aes(y = mean_sent, size = n), colour = "#2c7fb8") +
  geom_text(aes(y = mean_sent, label = round(mean_sent, 2)),
            vjust = -1, size = 3.2, colour = "#2c7fb8", fontface = "bold") +
  scale_x_continuous(breaks = 2019:2025) +
  scale_size_continuous(name = "Incident count", range = c(2, 6)) +
  labs(
    title    = "Annual Mean Sentiment Tone of Incident Descriptions (AFINN)",
    subtitle = "Negative scores indicate stronger blame, urgency, or severity language in field narratives",
    y        = "Mean AFINN Sentiment Score",
    x        = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

incidents_py = r.incidents_py

docs = (incidents_py
        .dropna(subset=["description"])
        .query("description.str.len() > 15", engine="python")
        .copy())

print(f"Documents available for text analysis: {len(docs):,}")

Documents available for text analysis: 861

Code

# ── Step 1: TF-IDF vectorisation ─────────────────────────────────
vec = TfidfVectorizer(
    stop_words  = "english",
    ngram_range = (1, 2),
    max_df      = 0.90,
    min_df      = 3
)
X            = vec.fit_transform(docs["description"].astype(str))
feature_names = vec.get_feature_names_out()

mean_tfidf = pd.Series(X.mean(axis=0).A1, index=feature_names)
top20      = mean_tfidf.sort_values(ascending=False).head(20).reset_index()
top20.columns = ["term", "mean_tfidf"]

fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=top20, x="mean_tfidf", y="term",
            palette="Blues_r", ax=ax)
ax.set_title("Top 20 Distinctive Terms Across All Incident Descriptions\n(TF-IDF, unigrams and bigrams)",
             fontsize=12, fontweight="bold")
ax.set_xlabel("Mean TF-IDF Score")
ax.set_ylabel(None)
plt.tight_layout()
plt.show()

Code

# ── Step 2: LDA topic model ───────────────────────────────────────
lda_py = LatentDirichletAllocation(
    n_components = 5,
    random_state = 42,
    max_iter     = 20
)
lda_py.fit(X)

LatentDirichletAllocation(max_iter=20, n_components=5, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

topic_names = [
    "Pipeline Breach & Physical Damage",
    "Production Deferral Impact",
    "Third-Party Interference",
    "Spill Response & Remediation",
    "Corrosion & Integrity Failure"
]

for i, (comp, ax) in enumerate(zip(lda_py.components_, axes[:5])):
    top_idx   = comp.argsort()[-10:][::-1]
    top_terms = [feature_names[j] for j in top_idx]
    top_vals  = comp[top_idx]
    sns.barplot(x=top_vals, y=top_terms, palette="Set2", ax=ax)
    ax.set_title(f"Topic {i+1}: {topic_names[i]}",
                 fontsize=9, fontweight="bold")
    ax.set_xlabel("Component Weight")
    ax.set_ylabel(None)

axes[5].set_visible(False)
plt.suptitle("LDA Topic Model: Top Terms per Topic (Python, sklearn)",
             fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Code

# ── Step 3: VADER sentiment trend ────────────────────────────────
sia  = SentimentIntensityAnalyzer()
docs = docs.assign(
    compound = docs["description"].astype(str).map(
        lambda s: sia.polarity_scores(s)["compound"]
    )
)

trend = (docs.groupby("year")["compound"]
         .agg(["mean", "count"])
         .reset_index()
         .rename(columns={"mean": "mean_compound", "count": "n"}))

fig, ax = plt.subplots(figsize=(9, 5))
ax.axhline(0, linestyle="--", color="grey", linewidth=0.8)
ax.plot(trend["year"], trend["mean_compound"],
        marker="o", linewidth=1.8, color="#2c7fb8")
for _, row in trend.iterrows():
    ax.annotate(f"{row['mean_compound']:.2f}",
                (row["year"], row["mean_compound"]),
                textcoords="offset points", xytext=(0, 8),
                fontsize=9, color="#2c7fb8", fontweight="bold")

Text(0, 8, '-0.13')
Text(0, 8, '-0.09')
Text(0, 8, '-0.15')
Text(0, 8, '-0.19')
Text(0, 8, '-0.17')
Text(0, 8, '-0.13')
Text(0, 8, '-0.05')

Code

ax.set_title("Annual Mean VADER Sentiment Score of Incident Descriptions",
             fontsize=12, fontweight="bold")

Text(0.5, 1.0, 'Annual Mean VADER Sentiment Score of Incident Descriptions')

Code

ax.set_xlabel(None)

Text(0.5, 0, '')

Code

ax.set_ylabel("Mean VADER Compound Score")

Text(0, 0.5, 'Mean VADER Compound Score')

Code

ax.xaxis.set_major_locator(mticker.MultipleLocator(1))
plt.tight_layout()
plt.show()

Code

print("\nSentiment trend by year:")


Sentiment trend by year:

Code

print(trend.to_string(index=False))

 year  mean_compound   n
 2019      -0.131294 305
 2020      -0.091726  58
 2021      -0.148253 156
 2022      -0.189747 119
 2023      -0.174168  66
 2024      -0.125047 124
 2025      -0.053721  33

5.4 Management Interpretation

What the TF-IDF analysis reveals

The TF-IDF analysis surfaces the words that most distinctively characterise each incident category. For the Sabotage category, terms relating to physical breach, cutting, and tapping dominate. This distinction matters for response planning:

Language associated with mechanically sophisticated interventions implies organised activity requiring security engagement
Language associated with opportunistic theft taps implies community-level economic desperation requiring social intervention

The structured cause field collapses both patterns into the same category. The text field separates them, providing the basis for a more targeted response strategy in 2026.

What the topic model reveals

The five topics extracted by the LDA model do not map neatly onto the five categories in the structured cause field. This is the key finding of this section. Three observations are worth highlighting:

The model identifies a topic characterised by spill response and remediation language that cuts across multiple cause categories. Incidents of very different types trigger the same operational response vocabulary, suggesting that response protocols are cause-agnostic in practice even when the underlying causes differ significantly
The third-party interference topic clusters language around physical installation and cutting activity at valve points and flowlines — a pattern consistent with the dominant sabotage mechanism identified in Section 4 but now confirmed from unstructured text independently of the structured cause field
For the 2026 risk register, response capacity planning should be sized by expected response burden per topic cluster, not by incident count alone

What the sentiment trend reveals

The AFINN and VADER sentiment scores measure the emotional tone of the language field engineers use when describing incidents. A consistently negative score across all years is expected given that these are incident reports. What matters operationally is the direction of the trend over time:

A worsening trend toward more negative scores signals that field personnel are describing incidents in increasingly severe or urgent terms
This may be an early indicator of deteriorating operational conditions that has not yet surfaced in the quantitative fields such as cost or volume
The direction and magnitude of the trend should inform whether the 2026 emergency response budget should be held flat or increased relative to 2025

Recommendation feeding into Section 10

The text analysis supports three specific inputs to the 2026 operational risk register:

The LDA topic most associated with third-party interference should be mapped to the facilities with the highest topic prevalence scores, and those facilities elevated in the security engagement programme
The response-and-remediation topic should be used to size boom and vessel inventory independent of cause category
Any year-on-year worsening in sentiment score should trigger a formal review of emergency response protocols before the annual operating plan is finalised

6. Technique 2: Monte Carlo Simulation

6.1 Technique Overview

Monte Carlo simulation is a mathematical risk assessment technique that approximates the probability of uncertain outcomes by running thousands of computerised simulations of random variables (Marr, 2017). Rather than producing a single-point estimate, which implies a false precision, the technique substitutes probability distributions for each uncertain input and repeats the calculation thousands of times, each run drawing randomly from those distributions. The result is not one answer but a full distribution of possible outcomes, from which percentile values can be extracted to communicate risk in probabilistic terms.

The technique is especially well-suited to the oil and gas sector, where cost estimation is inherently uncertain and the consequences of under-budgeting are severe (Marr, 2017). Three probability distributions are applied in this analysis, each chosen to match the statistical properties of the underlying data:

Poisson distribution for annual incident count: The Poisson distribution models the number of events occurring in a fixed time interval when events are independent and arrive at a constant average rate. Annual pipeline incident counts satisfy these conditions: incidents across different facilities are broadly independent, and the historical mean count provides a stable rate parameter (lambda)
Lognormal distribution for per-incident spill volume: Spill volumes are strictly positive and exhibit strong right skew, with a median of 1 barrel but a maximum exceeding 5,000 barrels. The lognormal distribution, obtained by fitting a normal distribution to the log-transformed volumes, accommodates this skew naturally and prevents the simulation from generating negative or implausibly large values
Gamma distribution for per-incident remediation cost: Remediation costs are also strictly positive and right-skewed, but with a heavier tail than spill volumes. The Gamma distribution, parameterised by shape and rate, is the standard choice for modelling positive continuous quantities with this profile and is widely used in insurance and project risk modelling

The simulation runs 10,000 iterations. In each iteration:

An annual incident count is drawn from the Poisson distribution
For each simulated incident, a spill volume and a remediation cost are independently drawn from their respective fitted distributions
The annual totals are summed

The resulting 10,000 annual cost estimates form an empirical distribution from which 10th percentile (P10), 50th percentile (P50), and 90th percentile (P90) values are extracted.

6.2 Operational Justification

The Project Services Department is required to submit risk-adjusted cost estimates to management for inclusion in the annual operating plan and Board reporting pack. The current practice relies on historical averages and engineering judgement to size the remediation budget reserve — a single-point estimate that provides no information about the probability of that estimate being correct or the magnitude of potential overrun.

Monte Carlo simulation addresses this gap directly in three ways:

Evidence-based distributions: Rather than assuming costs follow a normal distribution, the simulation fits distributions to the actual historical data. A lognormal is used for spill volumes and a Gamma for remediation costs, ensuring the model reflects the true shape of OERNL’s cost exposure including the heavy right tail driven by rare catastrophic spills
Defensible budget reserve: The P90 output answers the specific question management needs before the plan is submitted: what single budget figure, if held in reserve, would be sufficient in 90 out of every 100 years? This is a more honest and more useful number than a point estimate
Tornado sensitivity analysis: By re-running the simulation with each input held fixed at its mean while the others vary, a tornado chart identifies which uncertain input drives the most variance in the outcome. This tells the department where tighter data collection or risk controls would have the greatest impact on forecast accuracy

This section addresses sub-question (b) of the central analytical question: what cleanup and remediation budget reserve should be recommended at P90 confidence level for the 2026 capital plan?

6.3 Analysis and Outputs

Code

# ── Step 1: Extract fitted input data ────────────────────────────
qty <- incidents$qty_spilled_bbl[
  !is.na(incidents$qty_spilled_bbl) & incidents$qty_spilled_bbl > 0
]

cost <- incidents$cost_usd[
  !is.na(incidents$cost_usd) & incidents$cost_usd > 0
]

incidents_per_year <- incidents %>%
  count(year) %>%
  pull(n)

# ── Step 2: Fit distributions ─────────────────────────────────────
fit_qty  <- fitdist(log(qty), "norm")
fit_cost <- fitdist(cost, "gamma", lower = c(0, 0))
lambda   <- mean(incidents_per_year)

# ── Step 2b: Distribution fitting summary table ───────────────────
tibble::tibble(
  Parameter = c(
    "Non-zero spill volumes for fitting",
    "Non-zero cost records for fitting",
    "Poisson lambda (mean annual incidents)",
    "Lognormal mean (log scale)",
    "Lognormal SD (log scale)",
    "Gamma shape",
    "Gamma rate"
  ),
  Value = c(
    length(qty),
    length(cost),
    round(lambda, 1),
    round(fit_qty$estimate["mean"], 4),
    round(fit_qty$estimate["sd"], 4),
    round(fit_cost$estimate["shape"], 4),
    round(fit_cost$estimate["rate"], 6)
  )
) %>%
  kableExtra::kable(caption = "Distribution Fitting Summary — Monte Carlo Inputs") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  ) %>%
  kableExtra::column_spec(1, bold = TRUE)

Distribution Fitting Summary — Monte Carlo Inputs
Parameter	Value
Non-zero spill volumes for fitting	996.000000
Non-zero cost records for fitting	51.000000
Poisson lambda (mean annual incidents)	242.100000
Lognormal mean (log scale)	0.468300
Lognormal SD (log scale)	2.653100
Gamma shape	28.802300
Gamma rate	0.000407

Code

# ── Step 3: Plot fitted distributions against empirical data ──────
par(mfrow = c(1, 2))
denscomp(fit_qty,  main = "Spill Volume: Lognormal Fit (log scale)")
denscomp(fit_cost, main = "Remediation Cost: Gamma Fit")

Code

par(mfrow = c(1, 1))

# ── Step 4: Run Monte Carlo simulation — 10,000 iterations ───────
set.seed(42)
N <- 10000

sim <- replicate(N, {
  k <- rpois(1, lambda = lambda)
  if (k == 0) return(c(spill_bbl = 0, cost_usd = 0))
  spills <- exp(rnorm(k,
                      mean = fit_qty$estimate["mean"],
                      sd   = fit_qty$estimate["sd"]))
  costs  <- rgamma(k,
                   shape = fit_cost$estimate["shape"],
                   rate  = fit_cost$estimate["rate"])
  c(spill_bbl = sum(spills), cost_usd = sum(costs))
}) %>%
  t() %>%
  as_tibble()

# ── Step 5: P10 / P50 / P90 summary table ────────────────────────
q <- quantile(sim$cost_usd, c(0.10, 0.50, 0.90))

percentile_table <- tibble(
  Percentile    = c("P10 (Optimistic)", "P50 (Most Likely)", "P90 (Conservative Reserve)"),
  Annual_Cost   = scales::dollar(q),
  Interpretation = c(
    "In 1-in-10 low-incident years the total bill stays below this figure",
    "Half of all simulated years produce costs below this — the central estimate",
    "Hold this amount in reserve: costs exceed this in only 1 in 10 years"
  )
)

kableExtra::kable(percentile_table,
                  col.names = c("Percentile", "Annual Cost (USD)", "Plain-Language Meaning"),
                  caption   = "Table MC1: Monte Carlo Simulation — P10 / P50 / P90 Cost Outcomes") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE
  ) %>%
  kableExtra::column_spec(3, italic = TRUE)

Table MC1: Monte Carlo Simulation — P10 / P50 / P90 Cost Outcomes
Percentile	Annual Cost (USD)	Plain-Language Meaning
P10 (Optimistic)	$15,710,962	In 1-in-10 low-incident years the total bill stays below this figure
P50 (Most Likely)	$17,136,316	Half of all simulated years produce costs below this — the central estimate
P90 (Conservative Reserve)	$18,603,017	Hold this amount in reserve: costs exceed this in only 1 in 10 years

Code

# ── Step 6: Histogram of simulated annual costs ───────────────────
ggplot(sim, aes(x = cost_usd)) +
  geom_histogram(bins = 80, fill = "#2c7fb8", colour = "white",
                 linewidth = 0.2) +
  geom_vline(xintercept = q["10%"], linetype = "dashed",
             colour = "#27ae60", linewidth = 0.9) +
  geom_vline(xintercept = q["50%"], linetype = "dashed",
             colour = "#e67e22", linewidth = 0.9) +
  geom_vline(xintercept = q["90%"], linetype = "dashed",
             colour = "#e05c2a", linewidth = 1.2) +
  annotate("text", x = q["10%"], y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste0("P10\n", scales::dollar(q["10%"])),
           colour = "#27ae60", size = 3.2, fontface = "bold") +
  annotate("text", x = q["50%"], y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste0("P50\n", scales::dollar(q["50%"])),
           colour = "#e67e22", size = 3.2, fontface = "bold") +
  annotate("text", x = q["90%"], y = Inf, vjust = 1.5, hjust = -0.1,
           label = paste0("P90\n", scales::dollar(q["90%"])),
           colour = "#e05c2a", size = 3.2, fontface = "bold") +
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(
    title    = "Monte Carlo Simulation: Distribution of Simulated Annual Remediation Costs",
    subtitle = glue::glue("10,000 simulation runs | Lambda = {round(lambda,1)} incidents/year | P90 reserve shown in orange-red"),
    x        = "Simulated Annual Remediation Cost (USD)",
    y        = "Number of Simulation Runs"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Code

# ── Step 7: Tornado chart — sensitivity analysis ──────────────────
set.seed(42)

run_fixed <- function(fix_qty = FALSE, fix_cost = FALSE,
                      fix_lambda = FALSE) {
  replicate(N, {
    k <- if (fix_lambda) round(lambda) else rpois(1, lambda = lambda)
    if (k == 0) return(0)
    sp <- if (fix_qty) rep(exp(fit_qty$estimate["mean"]), k)
          else exp(rnorm(k, fit_qty$estimate["mean"],
                         fit_qty$estimate["sd"]))
    co <- if (fix_cost) rep(fit_cost$estimate["shape"] /
                              fit_cost$estimate["rate"], k)
          else rgamma(k, shape = fit_cost$estimate["shape"],
                      rate  = fit_cost$estimate["rate"])
    sum(co)
  }) %>% quantile(0.90)
}

base_p90     <- q["90%"]
p90_fix_qty  <- run_fixed(fix_qty    = TRUE)
p90_fix_cost <- run_fixed(fix_cost   = TRUE)
p90_fix_lam  <- run_fixed(fix_lambda = TRUE)

tornado_df <- tibble(
  Input = c("Per-incident spill volume\n(Lognormal)",
            "Per-incident remediation cost\n(Gamma)",
            "Annual incident count\n(Poisson)"),
  P90_reduction = c(
    base_p90 - p90_fix_qty,
    base_p90 - p90_fix_cost,
    base_p90 - p90_fix_lam
  )
) %>%
  arrange(desc(P90_reduction)) %>%
  mutate(Input = forcats::fct_reorder(Input, P90_reduction))

ggplot(tornado_df,
       aes(x = P90_reduction, y = Input,
           fill = P90_reduction > 0)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = scales::dollar(P90_reduction, accuracy = 1)),
            hjust = -0.15, fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c("TRUE" = "#e05c2a", "FALSE" = "#2c7fb8")) +
  scale_x_continuous(labels = scales::dollar_format(),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Tornado Chart: Reduction in P90 Cost When Each Input is Held Fixed",
    subtitle = "The largest bar identifies the input driving the most uncertainty in the P90 reserve estimate",
    x        = "Reduction in P90 Annual Cost (USD)",
    y        = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position    = "none",
    plot.title         = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from scipy import stats

inc   = r.incidents_py

qty   = pd.to_numeric(inc["qty_spilled_bbl"], errors="coerce").dropna()
qty   = qty[qty > 0].values

cost  = pd.to_numeric(inc["cost_usd"], errors="coerce").dropna()
cost  = cost[cost > 0].values

lam   = inc.groupby("year").size().mean()

print(f"Non-zero spill volumes: {len(qty):,}")

Non-zero spill volumes: 996

Code

print(f"Non-zero cost records:  {len(cost):,}")

Non-zero cost records:  51

Code

print(f"Mean annual incidents (lambda): {lam:.1f}")

Mean annual incidents (lambda): 242.1

Code

# ── Step 1: Fit distributions ─────────────────────────────────────
mu_log = np.log(qty).mean()
sd_log = np.log(qty).std()

shape, loc_g, scale_g = stats.gamma.fit(cost, floc=0)

print(f"\nLognormal fit — mu: {mu_log:.3f}, sigma: {sd_log:.3f}")


Lognormal fit — mu: 0.468, sigma: 2.653

Code

print(f"Gamma fit     — shape: {shape:.3f}, scale: {scale_g:.3f}")

Gamma fit     — shape: 28.830, scale: 2455.202

Code

# ── Step 2: Plot fitted distributions ────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

x_qty = np.linspace(qty.min(), np.percentile(qty, 99), 300)
axes[0].hist(qty, bins=60, density=True, color="#a8c8e8",
             edgecolor="white", alpha=0.7, label="Observed")

(array([1.13343347e-02, 2.38366660e-04, 8.34283310e-05, 5.95916650e-05,
       3.57549990e-05, 1.19183330e-05, 3.57549990e-05, 2.38366660e-05,
       0.00000000e+00, 1.19183330e-05, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.19183330e-05, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.19183330e-05, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.19183330e-05]), array([1.00000000e-03, 8.42423167e+01, 1.68483633e+02, 2.52724950e+02,
       3.36966267e+02, 4.21207583e+02, 5.05448900e+02, 5.89690217e+02,
       6.73931533e+02, 7.58172850e+02, 8.42414167e+02, 9.26655483e+02,
       1.01089680e+03, 1.09513812e+03, 1.17937943e+03, 1.26362075e+03,
       1.34786207e+03, 1.43210338e+03, 1.51634470e+03, 1.60058602e+03,
       1.68482733e+03, 1.76906865e+03, 1.85330997e+03, 1.93755128e+03,
       2.02179260e+03, 2.10603392e+03, 2.19027523e+03, 2.27451655e+03,
       2.35875787e+03, 2.44299918e+03, 2.52724050e+03, 2.61148182e+03,
       2.69572313e+03, 2.77996445e+03, 2.86420577e+03, 2.94844708e+03,
       3.03268840e+03, 3.11692972e+03, 3.20117103e+03, 3.28541235e+03,
       3.36965367e+03, 3.45389498e+03, 3.53813630e+03, 3.62237762e+03,
       3.70661893e+03, 3.79086025e+03, 3.87510157e+03, 3.95934288e+03,
       4.04358420e+03, 4.12782552e+03, 4.21206683e+03, 4.29630815e+03,
       4.38054947e+03, 4.46479078e+03, 4.54903210e+03, 4.63327342e+03,
       4.71751473e+03, 4.80175605e+03, 4.88599737e+03, 4.97023868e+03,
       5.05448000e+03]), <BarContainer object of 60 artists>)

Code

axes[0].plot(x_qty,
             stats.lognorm.pdf(x_qty, s=sd_log,
                               scale=np.exp(mu_log)),
             color="#2c7fb8", linewidth=2, label="Lognormal fit")

[<matplotlib.lines.Line2D object at 0x00000173F656AE50>]

Code

axes[0].set_title("Spill Volume: Lognormal Fit", fontweight="bold")

Text(0.5, 1.0, 'Spill Volume: Lognormal Fit')

Code

axes[0].set_xlabel("Barrels")

Text(0.5, 0, 'Barrels')

Code

axes[0].legend()

<matplotlib.legend.Legend object at 0x00000173F2C6C5D0>

Code

x_cost = np.linspace(cost.min(), np.percentile(cost, 99), 300)
axes[1].hist(cost, bins=40, density=True, color="#a8c8e8",
             edgecolor="white", alpha=0.7, label="Observed")

(array([1.30718954e-05, 0.00000000e+00, 0.00000000e+00, 1.30718954e-05,
       0.00000000e+00, 0.00000000e+00, 7.84313725e-05, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 2.87581699e-04, 0.00000000e+00, 0.00000000e+00,
       5.22875817e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       6.53594771e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 7.84313725e-05, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 2.61437908e-05, 0.00000000e+00,
       0.00000000e+00, 1.30718954e-05, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.92156863e-05]), array([ 45000.,  46500.,  48000.,  49500.,  51000.,  52500.,  54000.,
        55500.,  57000.,  58500.,  60000.,  61500.,  63000.,  64500.,
        66000.,  67500.,  69000.,  70500.,  72000.,  73500.,  75000.,
        76500.,  78000.,  79500.,  81000.,  82500.,  84000.,  85500.,
        87000.,  88500.,  90000.,  91500.,  93000.,  94500.,  96000.,
        97500.,  99000., 100500., 102000., 103500., 105000.]), <BarContainer object of 40 artists>)

Code

axes[1].plot(x_cost,
             stats.gamma.pdf(x_cost, a=shape, loc=0, scale=scale_g),
             color="#e05c2a", linewidth=2, label="Gamma fit")

[<matplotlib.lines.Line2D object at 0x00000173F67DEBD0>]

Code

axes[1].set_title("Remediation Cost: Gamma Fit", fontweight="bold")

Text(0.5, 1.0, 'Remediation Cost: Gamma Fit')

Code

axes[1].set_xlabel("USD")

Text(0.5, 0, 'USD')

Code

axes[1].legend()

<matplotlib.legend.Legend object at 0x00000173F6B1B2D0>

Code

plt.suptitle("Fitted Probability Distributions for Monte Carlo Inputs",
             fontsize=12, fontweight="bold")

Text(0.5, 0.98, 'Fitted Probability Distributions for Monte Carlo Inputs')

Code

plt.tight_layout()
plt.show()

Code

# ── Step 3: Monte Carlo simulation — 10,000 runs ─────────────────
rng       = np.random.default_rng(42)
N         = 10_000
sim_cost  = []
sim_spill = []

for _ in range(N):
    k = rng.poisson(lam)
    if k == 0:
        sim_cost.append(0)
        sim_spill.append(0)
    else:
        spills = stats.lognorm.rvs(s=sd_log, scale=np.exp(mu_log),
                                   size=k, random_state=rng)
        costs  = stats.gamma.rvs(a=shape, loc=0, scale=scale_g,
                                 size=k, random_state=rng)
        sim_cost.append(costs.sum())
        sim_spill.append(spills.sum())

sim_cost  = pd.Series(sim_cost)
sim_spill = pd.Series(sim_spill)

# ── Step 4: P10 / P50 / P90 table ────────────────────────────────
p10, p50, p90 = sim_cost.quantile([0.10, 0.50, 0.90])

summary = pd.DataFrame({
    "Percentile"  : ["P10 (Optimistic)", "P50 (Most Likely)",
                     "P90 (Reserve)"],
    "Annual Cost" : [f"${p10:,.0f}", f"${p50:,.0f}", f"${p90:,.0f}"]
})
print("\nMonte Carlo P10 / P50 / P90 Results:")


Monte Carlo P10 / P50 / P90 Results:

Code

print(summary.to_string(index=False))

       Percentile Annual Cost
 P10 (Optimistic) $15,732,964
P50 (Most Likely) $17,114,734
    P90 (Reserve) $18,542,653

Code

# ── Step 5: Histogram with percentile markers ─────────────────────
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(sim_cost, bins=80, color="#2c7fb8", edgecolor="white",
        linewidth=0.2, label="Simulated annual cost")

(array([  1.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   4.,  10.,  11.,
         8.,  14.,   9.,  18.,  21.,  27.,  24.,  36.,  48.,  59.,  64.,
        89., 102., 113., 130., 144., 183., 202., 211., 228., 238., 287.,
       308., 334., 295., 339., 358., 349., 332., 403., 361., 377., 340.,
       377., 315., 302., 301., 287., 276., 251., 224., 235., 186., 161.,
       148., 131., 124., 109.,  80.,  63.,  67.,  51.,  45.,  37.,  37.,
        20.,  20.,  15.,  13.,  12.,   9.,   7.,   7.,   4.,   1.,   3.,
         3.,   0.,   1.]), array([12943131.41294139, 13044877.11588526, 13146622.81882913,
       13248368.52177301, 13350114.22471688, 13451859.92766075,
       13553605.63060463, 13655351.3335485 , 13757097.03649237,
       13858842.73943624, 13960588.44238012, 14062334.14532399,
       14164079.84826786, 14265825.55121174, 14367571.25415561,
       14469316.95709948, 14571062.66004336, 14672808.36298723,
       14774554.0659311 , 14876299.76887498, 14978045.47181885,
       15079791.17476272, 15181536.8777066 , 15283282.58065047,
       15385028.28359434, 15486773.98653822, 15588519.68948209,
       15690265.39242596, 15792011.09536983, 15893756.79831371,
       15995502.50125758, 16097248.20420145, 16198993.90714533,
       16300739.6100892 , 16402485.31303307, 16504231.01597695,
       16605976.71892082, 16707722.42186469, 16809468.12480856,
       16911213.82775244, 17012959.53069631, 17114705.23364019,
       17216450.93658406, 17318196.63952793, 17419942.3424718 ,
       17521688.04541568, 17623433.74835955, 17725179.45130343,
       17826925.1542473 , 17928670.85719117, 18030416.56013504,
       18132162.26307892, 18233907.96602279, 18335653.66896667,
       18437399.37191054, 18539145.07485441, 18640890.77779828,
       18742636.48074216, 18844382.18368603, 18946127.8866299 ,
       19047873.58957378, 19149619.29251765, 19251364.99546152,
       19353110.6984054 , 19454856.40134927, 19556602.10429314,
       19658347.80723701, 19760093.51018089, 19861839.21312476,
       19963584.91606864, 20065330.61901251, 20167076.32195638,
       20268822.02490025, 20370567.72784413, 20472313.430788  ,
       20574059.13373187, 20675804.83667575, 20777550.53961962,
       20879296.24256349, 20981041.94550737, 21082787.64845124]), <BarContainer object of 80 artists>)

Code

for val, label, col in zip([p10, p50, p90],
                            ["P10", "P50", "P90"],
                            ["#27ae60", "#e67e22", "#e05c2a"]):
    ax.axvline(val, linestyle="--", color=col, linewidth=1.5)
    ax.text(val, ax.get_ylim()[1] * 0.92,
            f"{label}\n${val:,.0f}",
            color=col, fontsize=8, fontweight="bold", ha="left")

<matplotlib.lines.Line2D object at 0x00000173F6A6E610>
Text(15732963.99526468, 389.298, 'P10\n$15,732,964')
<matplotlib.lines.Line2D object at 0x00000173F6C39DD0>
Text(17114734.003167838, 389.298, 'P50\n$17,114,734')
<matplotlib.lines.Line2D object at 0x00000173F6CD4210>
Text(18542653.299791407, 389.298, 'P90\n$18,542,653')

Code

ax.xaxis.set_major_formatter(mticker.FuncFormatter(
    lambda x, _: f"${x:,.0f}"
))
ax.set_title(
    "Monte Carlo Simulation: Simulated Annual Remediation Cost Distribution\n(10,000 runs, Python)",
    fontsize=11, fontweight="bold")

Text(0.5, 1.0, 'Monte Carlo Simulation: Simulated Annual Remediation Cost Distribution\n(10,000 runs, Python)')

Code

ax.set_xlabel("Annual Remediation Cost (USD)")

Text(0.5, 0, 'Annual Remediation Cost (USD)')

Code

ax.set_ylabel("Simulation Runs")

Text(0, 0.5, 'Simulation Runs')

Code

plt.tight_layout()
plt.show()

Code

# ── Step 6: Tornado chart ─────────────────────────────────────────
def run_p90_fixed(fix_qty=False, fix_cost=False,
                  fix_lam=False, seed=99):
    rng2    = np.random.default_rng(seed)
    results = []
    for _ in range(N):
        k = round(lam) if fix_lam else rng2.poisson(lam)
        if k == 0:
            results.append(0)
            continue
        sp = (np.full(k, np.exp(mu_log)) if fix_qty
              else stats.lognorm.rvs(s=sd_log,
                                     scale=np.exp(mu_log),
                                     size=k,
                                     random_state=rng2))
        co = (np.full(k, shape * scale_g) if fix_cost
              else stats.gamma.rvs(a=shape, loc=0, scale=scale_g,
                                   size=k, random_state=rng2))
        results.append(co.sum())
    return pd.Series(results).quantile(0.90)

p90_base     = p90
p90_fix_qty  = run_p90_fixed(fix_qty=True)
p90_fix_cost = run_p90_fixed(fix_cost=True)
p90_fix_lam  = run_p90_fixed(fix_lam=True)

tornado = pd.DataFrame({
    "Input": ["Per-incident spill volume\n(Lognormal)",
              "Per-incident remediation cost\n(Gamma)",
              "Annual incident count\n(Poisson)"],
    "P90_reduction": [p90_base - p90_fix_qty,
                      p90_base - p90_fix_cost,
                      p90_base - p90_fix_lam]
}).sort_values("P90_reduction", ascending=True)

fig, ax = plt.subplots(figsize=(9, 4))
colors = ["#e05c2a" if v > 0 else "#2c7fb8"
          for v in tornado["P90_reduction"]]
ax.barh(tornado["Input"], tornado["P90_reduction"],
        color=colors, height=0.5)

<BarContainer object of 3 artists>

Code

for i, (val, _) in enumerate(
        zip(tornado["P90_reduction"], tornado["Input"])):
    ax.text(val + abs(tornado["P90_reduction"].max()) * 0.02,
            i, f"${val:,.0f}", va="center",
            fontsize=9, fontweight="bold")

Text(-34849.575106033604, 0, '$-57,700')
Text(20013.861448368505, 1, '$-2,837')
Text(1165388.6445052088, 2, '$1,142,538')

Code

ax.set_title("Tornado Chart: Which Input Drives the Most P90 Uncertainty?",
             fontsize=11, fontweight="bold")

Text(0.5, 1.0, 'Tornado Chart: Which Input Drives the Most P90 Uncertainty?')

Code

ax.set_xlabel("Reduction in P90 Annual Cost (USD)")

Text(0.5, 0, 'Reduction in P90 Annual Cost (USD)')

Code

ax.xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"${x:,.0f}"))
plt.tight_layout()
plt.show()

6.4 Management Interpretation

What the simulation is doing in plain terms

Think of the simulation as running 10,000 imaginary versions of the next year at OERNL. In each version, the number of incidents, the volume of each spill, and the cost of each cleanup are all allowed to vary randomly but within the bounds that history says is realistic. Some of those imaginary years are quiet, with few incidents and low costs. Others are expensive, with many incidents and at least one large spill. When all 10,000 years are lined up from cheapest to most expensive, the bill at any chosen probability level can be read directly from the distribution.

What the three numbers mean

The simulation produces three numbers that each serve a different planning purpose:

P10 (Optimistic): The cost in a genuinely good year where incidents are fewer than average and none are catastrophic. Costs fall below this level in only one in ten simulated years. It is not a sensible number to budget against because it assumes conditions better than what history typically delivers
P50 (Most Likely): The most honest central estimate. Half of all simulated years cost less than this and half cost more. If a single number is required for a planning document, this is the right one to use
P90 (Conservative Reserve): The recommended budget reserve. Holding this amount in the remediation contingency fund means that in nine out of every ten years the actual bill will be fully covered. Costs exceed this level only in genuinely extreme years, the kind driven by a large sabotage event coinciding with difficult weather or extended shutdown

Presenting the P90 figure to the Board as the contingency reserve is a more defensible position than any single-point estimate, because it comes with an explicit and honest statement of the probability attached to it.

What the tornado chart tells us

The tornado chart identifies which of the three uncertain inputs is responsible for the most spread in the annual cost estimate:

The input with the longest bar is the one that, if better controlled or better measured, would most reduce uncertainty in the budget
If per-incident remediation cost is the dominant driver, the priority action is improving cost capture in JIV reports so future simulations are fitted on a broader and more representative cost sample
If annual incident count is the dominant driver, the priority action is operational controls that reduce incident frequency rather than data collection improvements

This finding has a direct implication for where the department should invest its data quality effort ahead of the 2027 planning cycle.

Recommendation feeding into Section 10

The P90 figure from this simulation is the recommended remediation contingency reserve for inclusion in the 2026 annual operating plan. The tornado chart identifies the single input that most warrants tighter data collection or operational control heading into 2026. Both outputs are carried forward into the integrated recommendation in Section 10.

7. Technique 3: Advanced Forecasting with Prophet

7.1 Technique Overview

Time series analysis is a forecasting methodology that seeks to predict what will happen in the future based on what happened in the past, by identifying patterns in data collected at uniformly spaced time intervals (Marr, 2017). The patterns that time series methods seek to identify fall into two categories:

Trends: Systematic directional changes over time that do not repeat
Seasonality: Patterns that repeat at regular intervals such as monthly or annual cycles

Prophet, developed by Taylor and Letham (2018) at Meta, is an advanced decomposable time series model built on an additive regression framework. It decomposes the observed series into three components:

Trend component: A piecewise linear or logistic growth curve that models the overall direction of the series over time, with automatic detection of changepoints where the trend rate shifts
Seasonality component: Fourier series representations of periodic patterns at annual, weekly, and daily frequencies, fitted to the data without requiring the analyst to pre-specify the seasonal period
Error component: The residual variation not explained by trend or seasonality, assumed to be normally distributed with mean zero

Prophet’s key advantage over classical approaches such as Autoregressive Integrated Moving Average (ARIMA) is its robustness to missing data, outliers, and structural breaks in the trend. All of these are characteristic of pipeline incident data, where reporting gaps, operational shutdowns, and one-off crisis periods can distort the series. Prophet also requires no manual stationarity transformation or differencing, making it more accessible to non-specialist analysts while remaining statistically rigorous.

Model reliability is assessed through walk-forward cross-validation, a procedure in which the model is repeatedly trained on an expanding window of historical data and evaluated on a held-out forecast horizon. The Root Mean Square Error (RMSE) across folds provides an honest estimate of forecast accuracy on unseen data, rather than in-sample fit.

7.2 Operational Justification

Monthly pipeline incident counts at OERNL follow a time series structure: they are collected at uniformly spaced monthly intervals, they exhibit year-on-year trends driven by operational changes and security conditions, and they are likely to contain seasonal patterns linked to weather, community activity, and regulatory inspection cycles in the Niger Delta.

Three operational decisions make a six-month forward forecast directly useful for the 2026 planning cycle:

Emergency response capacity sizing: The forecast provides a quantified expectation of monthly incident volumes over the first half of 2026, which determines the minimum manpower, containment boom inventory, and recovery vessel deployment that should be in place at the start of the year
Contractor mobilisation scheduling: If the forecast identifies a peak period in Q2 or Q3 2026, contractor mobilisation can be advanced by the required lead time rather than being reactive to the actual incident spike
Budget phasing: Monthly forecast confidence intervals allow the operations budget to be phased across the year in line with expected activity, rather than being allocated uniformly across twelve months regardless of seasonal variation

This section addresses sub-question (c) of the central analytical question: what is the projected monthly incident volume over the next six months, and what does this imply for 2026 emergency response capacity?

7.3 Analysis and Outputs

Code

# ── Step 1: Aggregate to monthly time series ─────────────────────
monthly <- incidents %>%
  filter(!is.na(year_month)) %>%
  group_by(year_month) %>%
  summarise(
    spill_bbl   = sum(qty_spilled_bbl, na.rm = TRUE),
    n_incidents = n(),
    .groups     = "drop"
  ) %>%
  arrange(year_month)

tibble::tibble(
  Parameter = c(
    "Monthly periods available",
    "Date range start",
    "Date range end",
    "Mean monthly incidents",
    "Max monthly incidents"
  ),
  Value = c(
    nrow(monthly),
    format(min(monthly$year_month), "%b %Y"),
    format(max(monthly$year_month), "%b %Y"),
    round(mean(monthly$n_incidents), 1),
    max(monthly$n_incidents)
  )
) %>%
  kableExtra::kable(caption = "Monthly Incident Series Summary") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  ) %>%
  kableExtra::column_spec(1, bold = TRUE)

Monthly Incident Series Summary
Parameter	Value
Monthly periods available	75
Date range start	Jan 2019
Date range end	Nov 2025
Mean monthly incidents	22.6
Max monthly incidents	64

Code

# ── Step 2: Plot raw monthly series ──────────────────────────────
ggplot(monthly, aes(x = year_month, y = n_incidents)) +
  geom_line(colour = "#2c7fb8", linewidth = 0.8) +
  geom_point(colour = "#2c7fb8", size = 1.5) +
  geom_smooth(method = "loess", se = TRUE,
              colour = "#e05c2a", fill = "#f4c6b0",
              linewidth = 0.8, alpha = 0.3) +
  scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") +
  labs(
    title    = "Monthly Pipeline Incident Count (2019 to 2025)",
    subtitle = "Raw series with LOESS trend overlay — input to Prophet forecasting model",
    x        = NULL,
    y        = "Number of Incidents"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    axis.text.x      = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  )

Code

# ── Step 3: Fit Prophet model ─────────────────────────────────────
df_p <- monthly %>%
  transmute(ds = as.POSIXct(year_month), y = n_incidents)

mod <- prophet(
  df_p,
  yearly.seasonality  = TRUE,
  weekly.seasonality  = FALSE,
  daily.seasonality   = FALSE,
  interval.width      = 0.90
)

# ── Step 4: Generate 6-month forecast ────────────────────────────
fut  <- make_future_dataframe(mod, periods = 6, freq = "month")
fcst <- predict(mod, fut)

p1 <- plot(mod, fcst) +
  ggtitle("Monthly Incident Count: Prophet Forecast (6-Month Horizon)") +
  labs(subtitle = "Black dots = observed | Blue line = fitted and forecast | Shaded band = 90% prediction interval") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
print(p1)

Code

prophet_plot_components(mod, fcst)

Code

# ── Step 5: Extract and display forecast table ────────────────────
forecast_out <- fcst %>%
  filter(ds > max(df_p$ds)) %>%
  transmute(
    Month    = format(as.Date(ds), "%B %Y"),
    Forecast = round(yhat, 1),
    Lower_90 = round(yhat_lower, 1),
    Upper_90 = round(yhat_upper, 1)
  )

kableExtra::kable(forecast_out,
                  col.names = c("Month", "Forecast (incidents)",
                                "Lower 90% CI", "Upper 90% CI"),
                  caption   = "Table FC1: Six-Month Forward Forecast of Monthly Incident Count") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE
  ) %>%
  kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8")

Table FC1: Six-Month Forward Forecast of Monthly Incident Count
Month	Forecast (incidents)	Lower 90% CI	Upper 90% CI
December 2025	11.5	-7.8	30.0
January 2026	21.7	0.8	40.3
February 2026	8.9	-10.6	28.2
March 2026	13.4	-5.4	33.2
April 2026	7.1	-12.3	26.7
May 2026	14.5	-4.2	34.1

Code

# ── Step 6: Walk-forward cross-validation ────────────────────────
cv <- cross_validation(
  mod,
  initial = 365 * 3,
  period  = 90,
  horizon = 180,
  units   = "days"
)

pm <- performance_metrics(cv)

cv_summary <- tibble::tibble(
  Metric = c("Mean RMSE across CV folds",
             "Mean MAPE across CV folds",
             "Mean MAE across CV folds"),
  Value  = c(
    paste0(round(mean(pm$rmse,  na.rm = TRUE), 2), " incidents/month"),
    paste0(round(mean(pm$mape,  na.rm = TRUE) * 100, 1), "%"),
    paste0(round(mean(pm$mae,   na.rm = TRUE), 2), " incidents/month")
  )
)

kableExtra::kable(cv_summary,
                  caption = "Walk-Forward Cross-Validation Performance Summary") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  ) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8")

Walk-Forward Cross-Validation Performance Summary
Metric	Value
Mean RMSE across CV folds	15.41 incidents/month
Mean MAPE across CV folds	114.7%
Mean MAE across CV folds	12.73 incidents/month

Code

# ── Step 7: Plot RMSE over forecast horizon ───────────────────────
pm %>%
  mutate(horizon_days = as.numeric(horizon, units = "days")) %>%
  ggplot(aes(x = horizon_days, y = rmse)) +
  geom_line(colour = "#2c7fb8", linewidth = 0.9) +
  geom_point(colour = "#2c7fb8", size = 2) +
  geom_smooth(method = "loess", se = FALSE,
              colour = "#e05c2a", linewidth = 0.7,
              linetype = "dashed") +
  labs(
    title    = "Walk-Forward Cross-Validation: RMSE by Forecast Horizon",
    subtitle = "RMSE increasing with horizon is expected — confirms model degrades gracefully rather than catastrophically",
    x        = "Forecast Horizon (Days)",
    y        = "RMSE (Incidents per Month)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from prophet import Prophet

Importing plotly failed. Interactive plots will not work.

Code

from prophet.diagnostics import cross_validation, performance_metrics

# ── Step 1: Aggregate to monthly series ──────────────────────────
inc_py = r.incidents_py

inc_py["year_month_dt"] = pd.to_datetime(
    inc_py["year_month_str"], errors="coerce"
)

monthly_py = (
    inc_py.dropna(subset=["year_month_dt"])
    .groupby("year_month_dt")
    .agg(
        spill_bbl   = ("qty_spilled_bbl", "sum"),
        n_incidents = ("ref_no", "count")
    )
    .reset_index()
    .rename(columns={"year_month_dt": "ds", "n_incidents": "y"})
    .sort_values("ds")
)

print(f"Monthly periods available: {len(monthly_py)}")

Monthly periods available: 75

Code

print(f"Date range: {monthly_py['ds'].min().strftime('%b %Y')} "
      f"to {monthly_py['ds'].max().strftime('%b %Y')}")

Date range: Jan 2019 to Nov 2025

Code

print(f"Mean monthly incidents: {monthly_py['y'].mean():.1f}")

Mean monthly incidents: 22.6

Code

# ── Step 2: Plot raw monthly series ──────────────────────────────
fig, ax = plt.subplots(figsize=(11, 4))
ax.plot(monthly_py["ds"], monthly_py["y"],
        color="#2c7fb8", linewidth=0.9,
        marker="o", markersize=3)

[<matplotlib.lines.Line2D object at 0x00000173F6A6C3D0>]

Code

ax.set_title("Monthly Pipeline Incident Count (2019 to 2025)",
             fontsize=12, fontweight="bold")

Text(0.5, 1.0, 'Monthly Pipeline Incident Count (2019 to 2025)')

Code

ax.set_xlabel(None)

Text(0.5, 0, '')

Code

ax.set_ylabel("Number of Incidents")

Text(0, 0.5, 'Number of Incidents')

Code

ax.xaxis.set_major_locator(
    plt.matplotlib.dates.MonthLocator(bymonth=[1, 7])
)
ax.xaxis.set_major_formatter(
    plt.matplotlib.dates.DateFormatter("%b %Y")
)
plt.xticks(rotation=45)

(array([17897., 18078., 18262., 18444., 18628., 18809., 18993., 19174.,
       19358., 19539., 19723., 19905., 20089., 20270., 20454.]), [Text(17897.0, 0, 'Jan 2019'), Text(18078.0, 0, 'Jul 2019'), Text(18262.0, 0, 'Jan 2020'), Text(18444.0, 0, 'Jul 2020'), Text(18628.0, 0, 'Jan 2021'), Text(18809.0, 0, 'Jul 2021'), Text(18993.0, 0, 'Jan 2022'), Text(19174.0, 0, 'Jul 2022'), Text(19358.0, 0, 'Jan 2023'), Text(19539.0, 0, 'Jul 2023'), Text(19723.0, 0, 'Jan 2024'), Text(19905.0, 0, 'Jul 2024'), Text(20089.0, 0, 'Jan 2025'), Text(20270.0, 0, 'Jul 2025'), Text(20454.0, 0, 'Jan 2026')])

Code

plt.tight_layout()
plt.show()

Code

# ── Step 3: Fit Prophet model ─────────────────────────────────────
m = Prophet(
    yearly_seasonality = True,
    weekly_seasonality = False,
    daily_seasonality  = False,
    interval_width     = 0.90
)
m.fit(monthly_py[["ds", "y"]])

<prophet.forecaster.Prophet object at 0x0000017380378190>

Code

# ── Step 4: Generate 6-month forecast ────────────────────────────
future  = m.make_future_dataframe(periods=6, freq="MS")
fcst_py = m.predict(future)

fig1 = m.plot(fcst_py)
fig1.suptitle(
    "Monthly Incident Count: Prophet Forecast (6-Month Horizon)\n"
    "Black dots = observed | Blue = fitted/forecast | Shaded = 90% CI",
    fontsize=10, fontweight="bold", y=1.02)

Text(0.5, 1.02, 'Monthly Incident Count: Prophet Forecast (6-Month Horizon)\nBlack dots = observed | Blue = fitted/forecast | Shaded = 90% CI')

Code

plt.tight_layout()
plt.show()

Code

fig2 = m.plot_components(fcst_py)
fig2.suptitle("Prophet Model Components: Trend and Seasonality",
              fontsize=10, fontweight="bold", y=1.02)

Text(0.5, 1.02, 'Prophet Model Components: Trend and Seasonality')

Code

plt.tight_layout()
plt.show()

Code

# ── Step 5: Forecast table ────────────────────────────────────────
fwd = fcst_py[fcst_py["ds"] > monthly_py["ds"].max()][
    ["ds", "yhat", "yhat_lower", "yhat_upper"]
].copy()
fwd["ds"]  = fwd["ds"].dt.strftime("%B %Y")
fwd.columns = ["Month", "Forecast", "Lower 90% CI", "Upper 90% CI"]
fwd         = fwd.round(1)

print("\nSix-Month Forward Forecast:")


Six-Month Forward Forecast:

Code

print(fwd.to_string(index=False))

        Month  Forecast  Lower 90% CI  Upper 90% CI
December 2025      11.5          -7.7          30.4
 January 2026      21.7           2.4          40.8
February 2026       8.9         -11.0          27.6
   March 2026      13.4          -7.5          31.0
   April 2026       7.1         -11.6          27.1
     May 2026      14.5          -4.7          33.4

Code

# ── Step 6: Walk-forward cross-validation ────────────────────────
cv_py = cross_validation(
    m,
    initial = "1095 days",
    period  = "90 days",
    horizon = "180 days"
)


  0%|          | 0/14 [00:00<?, ?it/s]
  7%|7         | 1/14 [00:00<00:04,  2.89it/s]
 14%|#4        | 2/14 [00:00<00:04,  2.85it/s]
 21%|##1       | 3/14 [00:01<00:04,  2.73it/s]
 29%|##8       | 4/14 [00:01<00:03,  2.79it/s]
 36%|###5      | 5/14 [00:01<00:03,  2.84it/s]
 43%|####2     | 6/14 [00:02<00:02,  2.77it/s]
 50%|#####     | 7/14 [00:02<00:02,  2.78it/s]
 57%|#####7    | 8/14 [00:03<00:02,  2.44it/s]
 64%|######4   | 9/14 [00:03<00:02,  2.41it/s]
 71%|#######1  | 10/14 [00:03<00:01,  2.34it/s]
 79%|#######8  | 11/14 [00:04<00:01,  2.27it/s]
 86%|########5 | 12/14 [00:04<00:00,  2.22it/s]
 93%|#########2| 13/14 [00:05<00:00,  2.30it/s]
100%|##########| 14/14 [00:05<00:00,  2.47it/s]
100%|##########| 14/14 [00:05<00:00,  2.51it/s]

Code

pm_py = performance_metrics(cv_py)

print("\nWalk-forward CV performance metrics (first rows):")


Walk-forward CV performance metrics (first rows):

Code

print(pm_py[["horizon", "rmse", "mae", "mape"]]
      .head(8).to_string(index=False))

horizon      rmse       mae     mape
18 days  9.665885  8.219469 0.372737
19 days  7.259892  6.049043 0.302977
20 days  7.231829  6.029140 0.345289
22 days  9.280981  7.396357 0.522668
24 days 10.095505  8.411095 0.574285
25 days 12.683927 11.197466 0.980924
27 days 21.006788 16.998447 1.061973
40 days 21.005940 16.988272 1.073142

Code

print(f"\nMean RMSE across folds: {pm_py['rmse'].mean():.2f} incidents/month")


Mean RMSE across folds: 15.41 incidents/month

Code

print(f"Mean MAPE across folds: {pm_py['mape'].mean()*100:.1f}%")

Mean MAPE across folds: 114.7%

Code

# ── Step 7: RMSE over horizon plot ───────────────────────────────
pm_py["horizon_days"] = pm_py["horizon"].dt.days

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(pm_py["horizon_days"], pm_py["rmse"],
        color="#2c7fb8", linewidth=1.2,
        marker="o", markersize=4)

[<matplotlib.lines.Line2D object at 0x00000173805D97D0>]

Code

ax.set_title("Walk-Forward Cross-Validation: RMSE by Forecast Horizon",
             fontsize=11, fontweight="bold")

Text(0.5, 1.0, 'Walk-Forward Cross-Validation: RMSE by Forecast Horizon')

Code

ax.set_xlabel("Forecast Horizon (Days)")

Text(0.5, 0, 'Forecast Horizon (Days)')

Code

ax.set_ylabel("RMSE (Incidents per Month)")

Text(0, 0.5, 'RMSE (Incidents per Month)')

Code

plt.tight_layout()
plt.show()

7.4 Management Interpretation

What the forecast is and is not

The Prophet model has learned the historical pattern of monthly pipeline incidents at OERNL — how many tend to occur in a typical month, how the number has trended over seven years, and whether certain times of year consistently produce more or fewer incidents. It uses that learned pattern to project forward six months.

It is important to be clear about what this forecast is and is not:

It is not a guarantee of what will happen. Pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can predict with certainty
It is an evidence-based expectation of what is most likely, combined with an honest statement of the uncertainty around that expectation expressed as a 90% confidence interval
It is most useful as a capacity planning tool rather than a precise prediction, because the upper bound of the confidence interval provides a more appropriate basis for resource sizing than the central estimate

How to read the forecast table

Each row in the forecast table gives three numbers for a future month:

Central forecast: The most likely outcome based on historical trend and seasonality patterns
Lower 90% bound: The floor that actual incident counts will exceed nine times out of ten
Upper 90% bound: The ceiling that actual counts will stay below nine times out of ten

A wide band between the lower and upper bounds means the model is less certain about that month. This is honest, not a weakness. A narrow band means the pattern is stable and predictable. The upper bound is the number to use for capacity planning.

What the component plots reveal

The component decomposition plots provide two pieces of operational intelligence:

The trend component shows whether incident frequency has been rising, falling, or holding steady over seven years. An upward trend heading into 2026 is a direct signal that standing capacity from 2025 will be insufficient for the year ahead
The seasonality component shows which months historically produce elevated incident counts. If the peak months in the seasonality plot align with the highest forecast values in the table, that confirms the peak is structural and predictable. Contractor mobilisation should be timed to arrive before that peak, not after it

What the cross-validation result means

The RMSE from walk-forward cross-validation is the single most important diagnostic output in this section. It measures how wrong the model typically is when forecasting on historical data it has never seen before. A mean RMSE of 15.41 incidents per month means the model’s monthly forecast is typically within 15 incidents of the actual count on unseen periods.

This figure should be used directly in the capacity planning calculation. If the forecast for a given month is 20 incidents, the response capacity should be sized for approximately 35 incidents at the upper end of the confidence range, not exactly 20. Planning to the central estimate with no buffer is equivalent to planning for a median outcome while accepting a coin-flip chance of being under-resourced.

Recommendation feeding into Section 10

The six-month forecast directly determines the 2026 emergency response capacity recommendation in Section 10. The upper bound of the 90% confidence interval for the peak forecast month is the planning figure that should be used for manpower and equipment sizing. The cost of being under-resourced during a peak incident month exceeds the cost of holding excess capacity during a quiet one.

8. Technique 4: Asset Survival Analysis

8.1 Technique Overview

Survival analysis is a statistical method for modelling the time elapsed until an event of interest occurs. Originally developed in clinical and actuarial research, it has been widely adopted in engineering reliability analysis, customer churn analytics, and asset management — anywhere the central question is not simply whether an event will occur, but when (Kleinbaum and Klein, 2012). In the context of asset management, the technique is directly analogous to customer churn analytics: just as customer churn analytics assesses how long a customer stays before leaving (Marr, 2017), asset survival analysis assesses how long a facility remains incident-free between consecutive failures.

This analysis applies the technique to inter-arrival times, specifically the number of days between consecutive incidents at the same facility, rather than time to first incident. This approach uses all 1,695 incident records rather than only the first event per facility, producing a richer and more statistically powerful dataset for modelling recurring failure patterns across the portfolio.

Two complementary methods are applied:

Kaplan-Meier (KM) estimator: A non-parametric method that produces an empirical survival curve showing the probability that a facility remains incident-free beyond any given number of days after a previous incident. The KM estimator makes no assumptions about the underlying failure time distribution, which is appropriate given the diversity of facility types and operating conditions in the OERNL portfolio. Curves are stratified by facility category to reveal which asset classes exhibit the shortest inter-incident times
Cox Proportional Hazards (Cox PH) model: A semi-parametric regression model that estimates the effect of covariates — facility category and state — on the hazard rate, defined as the instantaneous risk of the next incident given that no incident has yet occurred since the last one. The model expresses results as hazard ratios: a hazard ratio above 1 for a covariate means that category or state is associated with a higher rate of recurrence relative to the reference group, controlling for all other variables. The proportional hazards assumption is verified using Schoenfeld residuals via the cox.zph test

Censoring is handled correctly throughout. Facilities that have not experienced a subsequent incident by the end of the observation window contribute their observed inter-incident time as a right-censored observation, ensuring no data is discarded.

8.2 Operational Justification

The current approach to pipeline integrity inspection prioritisation at OERNL relies on engineering judgement and periodic field reports, with no systematic basis for comparing recurrence risk across the 446 facilities in the portfolio.

Inter-incident survival analysis addresses this gap in three specific ways:

Recurrence-based prioritisation: Facilities with the shortest median inter-incident time, those that fail most frequently relative to their own history, are flagged for priority inspection scheduling in the 2026 maintenance programme. This is more operationally meaningful than ranking by total incident count alone, because it identifies facilities where failure is accelerating rather than those that have simply been in service longest
Covariate-adjusted hazard ranking: The Cox model identifies which facility categories and states carry statistically significantly higher hazard rates after controlling for all other factors. This allows the maintenance budget to be allocated by evidence-based risk rather than by administrative area or operational familiarity
Defensible business case: Hazard ratios with confidence intervals provide statistically defensible evidence when presenting integrity intervention projects to Finance and the Board, replacing subjective risk assertions with quantified, model-derived estimates

This section addresses sub-question (d) of the central analytical question: which facilities exhibit the shortest expected time to next incident, and which should be elevated to the top quartile of the 2026 integrity inspection schedule?

8.3 Analysis and Outputs

Code

if ("survminer" %in% (.packages())) detach("package:survminer", unload = TRUE)

# ── Step 1: Build inter-incident time dataset ─────────────────────
tte <- incidents %>%
  filter(!is.na(facility), !is.na(date_incident)) %>%
  arrange(facility, date_incident) %>%
  group_by(facility) %>%
  mutate(
    tte_days = as.integer(date_incident - lag(date_incident)),
    event    = 1L
  ) %>%
  ungroup() %>%
  filter(!is.na(tte_days), tte_days > 0)

tibble::tibble(
  Metric = c("Inter-incident intervals available",
             "Unique facilities with repeat incidents",
             "Median inter-incident time (days)",
             "Mean inter-incident time (days)"),
  Value  = c(nrow(tte),
             n_distinct(tte$facility),
             median(tte$tte_days, na.rm = TRUE),
             round(mean(tte$tte_days, na.rm = TRUE), 1))
) %>%
  kableExtra::kable(caption = "Inter-Incident Time Dataset Summary") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  ) %>%
  kableExtra::column_spec(1, bold = TRUE)

Inter-Incident Time Dataset Summary
Metric	Value
Inter-incident intervals available	1084.0
Unique facilities with repeat incidents	147.0
Median inter-incident time (days)	13.0
Mean inter-incident time (days)	117.9

Code

# ── Step 2: Distribution of inter-incident times ──────────────────
ggplot(tte, aes(x = tte_days)) +
  geom_histogram(bins = 60, fill = "#2c7fb8",
                 colour = "white", linewidth = 0.2) +
  geom_vline(xintercept = median(tte$tte_days, na.rm = TRUE),
             colour = "#e05c2a", linetype = "dashed",
             linewidth = 0.9) +
  annotate("text",
           x = median(tte$tte_days, na.rm = TRUE) + 5,
           y = Inf, vjust = 1.5, hjust = 0,
           label = paste0("Median = ",
                          median(tte$tte_days, na.rm = TRUE), " days"),
           colour = "#e05c2a", fontface = "italic", size = 3.5) +
  scale_x_continuous(limits = c(0, 365),
                     breaks  = seq(0, 365, by = 30)) +
  labs(
    title    = "Distribution of Inter-Incident Times (Days Between Consecutive Incidents)",
    subtitle = "Truncated at 365 days for readability",
    x        = "Days Between Consecutive Incidents at Same Facility",
    y        = "Number of Intervals"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

Code

# ── Step 3: KM by facility category ──────────────────────────────
top_cats <- tte %>%
  filter(!is.na(facility_cat)) %>%
  count(facility_cat, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(facility_cat)

tte_cats    <- tte %>% filter(facility_cat %in% top_cats)
km_cat      <- survfit(Surv(tte_days, event) ~ facility_cat,
                       data = tte_cats)
km_cat_tidy <- broom::tidy(km_cat) %>%
  mutate(strata = stringr::str_remove(strata, "facility_cat="))

ggplot(km_cat_tidy, aes(x = time, y = estimate,
                         colour = strata, fill = strata)) +
  geom_step(linewidth = 0.9) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
              alpha = 0.08, colour = NA) +
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette   = "Set1") +
  coord_cartesian(xlim = c(0, 500)) +
  labs(
    title    = "Kaplan-Meier: Time Between Incidents by Facility Category",
    subtitle = "Steeper curve = shorter inter-incident time = higher recurrence risk",
    x        = "Days Since Previous Incident",
    y        = "Probability of Remaining Incident-Free",
    colour   = "Facility Category",
    fill     = "Facility Category"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title      = element_text(face = "bold"),
    legend.position = "bottom"
  )

Code

# ── Step 4: KM by operational area ───────────────────────────────
tte_area     <- tte %>% filter(area %in% c("LAR", "SAR"))
km_area      <- survfit(Surv(tte_days, event) ~ area,
                        data = tte_area)
km_area_tidy <- broom::tidy(km_area) %>%
  mutate(strata = stringr::str_remove(strata, "area="))

lr_test <- survdiff(Surv(tte_days, event) ~ area, data = tte_area)
lr_pval <- round(1 - pchisq(lr_test$chisq, df = 1), 4)

ggplot(km_area_tidy, aes(x = time, y = estimate,
                          colour = strata, fill = strata)) +
  geom_step(linewidth = 0.9) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
              alpha = 0.10, colour = NA) +
  scale_colour_manual(values = c("LAR" = "#2c7fb8",
                                  "SAR" = "#e05c2a")) +
  scale_fill_manual(values   = c("LAR" = "#2c7fb8",
                                  "SAR" = "#e05c2a")) +
  coord_cartesian(xlim = c(0, 500)) +
  annotate("text", x = 400, y = 0.95,
           label = paste0("Log-rank p = ", lr_pval),
           size = 3.5, fontface = "italic", colour = "#333333") +
  labs(
    title    = "Kaplan-Meier: Time Between Incidents by Operational Area",
    subtitle = "Log-rank test p-value tests whether survival difference is statistically significant",
    x        = "Days Since Previous Incident",
    y        = "Probability of Remaining Incident-Free",
    colour   = "Area",
    fill     = "Area"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title      = element_text(face = "bold"),
    legend.position = "bottom"
  )

Code

# ── Step 5: Cox Proportional Hazards model ────────────────────────
cox_data <- tte %>%
  filter(!is.na(facility_cat), !is.na(state),
         facility_cat %in% top_cats)

cox <- coxph(
  Surv(tte_days, event) ~ facility_cat + state,
  data = cox_data
)

# ── Step 6: Hazard ratio table ────────────────────────────────────
cox_tidy <- broom::tidy(cox, exponentiate = TRUE,
                         conf.int = TRUE) %>%
  mutate(
    term   = stringr::str_remove(term, "facility_cat|state"),
    signif = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.10  ~ ".",
      TRUE            ~ ""
    )
  ) %>%
  dplyr::select(term, estimate, conf.low, conf.high,
                p.value, signif) %>%
  mutate(across(c(estimate, conf.low, conf.high), ~ round(.x, 3)),
         p.value = round(p.value, 4))

kableExtra::kable(cox_tidy,
                  col.names = c("Covariate", "Hazard Ratio",
                                "Lower 95% CI", "Upper 95% CI",
                                "p-value", "Significance"),
                  caption   = "Table SA1: Cox PH Model — Hazard Ratios by Facility Category and State") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE
  ) %>%
  kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8") %>%
  kableExtra::footnote(
    general = "Reference: lowest-risk facility type and state. HR > 1 = higher recurrence hazard. *** p<0.001, ** p<0.01, * p<0.05"
  )

Table SA1: Cox PH Model — Hazard Ratios by Facility Category and State
Covariate	Hazard Ratio	Lower 95% CI	Upper 95% CI	p-value	Significance
Manifold	1.315	0.183	9.455	0.7854
Pipeline (Gas)	0.417	0.178	0.974	0.0433	*
Pipeline (Oil)	2.262	1.766	2.896	0.0000	***
Well Head	0.281	0.088	0.896	0.0319	*
Delta	0.273	0.135	0.554	0.0003	***
Imo	0.162	0.066	0.398	0.0001	***
Rivers	0.544	0.433	0.685	0.0000	***
Note:
Reference: lowest-risk facility type and state. HR > 1 = higher recurrence hazard. * p<0.001, p<0.01, * p<0.05

Code

# ── Step 7: Proportional hazards assumption check ────────────────
ph_test <- cox.zph(cox)
ph_df_summary <- as.data.frame(ph_test$table) %>%
  tibble::rownames_to_column("Covariate") %>%
  dplyr::rename(
    `Chi-squared` = chisq,
    `Degrees of Freedom` = df,
    `p-value` = p
  ) %>%
  dplyr::mutate(
    `Chi-squared` = round(`Chi-squared`, 3),
    `p-value`     = round(`p-value`, 5),
    Result        = ifelse(`p-value` > 0.05,
                           "Assumption holds",
                           "Potential violation")
  )

kableExtra::kable(ph_df_summary,
                  caption = "Schoenfeld Residuals: Proportional Hazards Assumption Test") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  ) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::column_spec(5,
    color = ifelse(ph_df_summary$`p-value` > 0.05,
                   "#27ae60", "#e05c2a"))

Schoenfeld Residuals: Proportional Hazards Assumption Test
Covariate	Chi-squared	Degrees of Freedom	p-value	Result
facility_cat	14.858	4	0.00501	Potential violation
state	19.183	3	0.00025	Potential violation
GLOBAL	27.850	7	0.00023	Potential violation

Code

ph_mat  <- ph_test$y
ph_time <- ph_test$time

if (!is.null(ph_mat) && nrow(ph_mat) > 0 &&
    length(ph_time) > 0) {

  ph_df <- as.data.frame(ph_mat) %>%
    mutate(time = ph_time) %>%
    tidyr::pivot_longer(
      cols      = -time,
      names_to  = "covariate",
      values_to = "residual"
    ) %>%
    filter(!is.na(residual), !is.na(time))

  print(
    ggplot(ph_df, aes(x = time, y = residual)) +
      geom_point(alpha = 0.3, size = 0.8, colour = "#2c7fb8") +
      geom_smooth(method = "loess", se = TRUE,
                  colour = "#e05c2a", fill = "#f4c6b0",
                  linewidth = 0.8) +
      geom_hline(yintercept = 0, linetype = "dashed",
                 colour = "grey50") +
      facet_wrap(~ covariate, scales = "free_y", ncol = 2) +
      labs(
        title    = "Schoenfeld Residuals: Proportional Hazards Assumption Check",
        subtitle = "Flat LOESS line = assumption satisfied | Trending line = assumption may be violated",
        x        = "Time",
        y        = "Scaled Schoenfeld Residual"
      ) +
      theme_minimal(base_size = 10) +
      theme(plot.title = element_text(face = "bold"))
  )

} else {
  cat("Schoenfeld residual plot skipped — insufficient data.\n")
}

Code

# ── Step 8: High-risk facility summary table ──────────────────────
high_risk <- tte %>%
  group_by(facility, state, area, facility_cat) %>%
  summarise(
    n_intervals     = n(),
    median_tte_days = median(tte_days, na.rm = TRUE),
    mean_tte_days   = round(mean(tte_days, na.rm = TRUE), 1),
    .groups         = "drop"
  ) %>%
  filter(n_intervals >= 3) %>%
  arrange(median_tte_days) %>%
  slice_head(n = 20) %>%
  rename(
    Facility                        = facility,
    State                           = state,
    Area                            = area,
    `Facility Type`                 = facility_cat,
    `Repeat Intervals`              = n_intervals,
    `Median Days Between Incidents` = median_tte_days,
    `Mean Days Between Incidents`   = mean_tte_days
  )

kableExtra::kable(high_risk,
                  caption = "Table SA2: Top 20 Highest-Recurrence Facilities (Minimum 3 Inter-Incident Intervals)") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE
  ) %>%
  kableExtra::column_spec(6, bold = TRUE, color = "#e05c2a") %>%
  kableExtra::row_spec(1:5, background = "#fef3f0")

Table SA2: Top 20 Highest-Recurrence Facilities (Minimum 3 Inter-Incident Intervals)
Facility	State	Area	Facility Type	Repeat Intervals	Median Days Between Incidents	Mean Days Between Incidents
24'' Ogoda - Brass Pipeline	NA	NA	NA	3	1.0	4.7
Ebocha 4Ls & Ss flowline	Rivers	LAR	Flowline	4	1.5	1.5
18" Tebidaba/Brass Pipeline	Bayelsa	SAR	Pipeline (Oil)	131	2.0	6.8
10" Clough Creek/Tebidaba Pipeline	Bayelsa	SAR	Pipeline (Oil)	141	3.0	5.6
6'' Tuomo/Ogboinbiri pipeline	Bayelsa	SAR	NA	25	4.0	38.1
18" Tebidaba Brass pipeline	NA	NA	NA	13	5.0	81.2
18" Tebidaba/Brass Pipeline	Bayelsa	SAR	NA	86	6.0	17.2
10" Clough Creek/Tebidaba Pipeline	Bayelsa	SAR	NA	102	7.0	13.5
14'' Ogboinbiri/ Tebidaba P/L	NA	NA	NA	6	9.5	79.3
14" Ogboinbiri/Tebidaba pipeline	Bayelsa	SAR	NA	29	10.0	37.7
24" Ogoda/Brass Pipeline	Bayelsa	SAR	NA	3	10.0	17.3
18" EOC/Ogoda Pipeline	Rivers	LAR	Pipeline (Oil)	18	10.5	45.4
6'' Tuomo/Ogboinbiri flowline	Bayelsa	SAR	Flowline	10	10.5	54.3
4'' Oshie 15Ss flowline	Rivers	LAR	NA	13	11.0	78.1
6'' Tuomo/Ogboinbiri pipeline	Bayelsa	LAR	NA	9	11.0	41.3
Obiafu 41/42T flowline	Rivers	LAR	NA	4	12.0	21.0
14'' Akri/Ebocha pipeline	Imo	LAR	NA	7	13.0	43.1
24" Ogoda/Brass Pipeline	Bayelsa	SAR	Pipeline (Oil)	6	13.5	33.7
24" Ogoda/Brass Pipeline	Rivers	LAR	NA	29	15.0	38.8
Ebocha 5T flowline	Rivers	LAR	Flowline	7	16.0	249.7

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test

inc_py = r.incidents_py
inc_py["date_incident_dt"] = pd.to_datetime(
    inc_py["date_incident_str"], errors="coerce"
)

tte_py = (
    inc_py.dropna(subset=["facility", "date_incident_dt"])
    .sort_values(["facility", "date_incident_dt"])
    .copy()
)

tte_py["prev_incident"] = tte_py.groupby("facility")[
    "date_incident_dt"].shift(1)
tte_py["tte_days"]      = (
    tte_py["date_incident_dt"] - tte_py["prev_incident"]).dt.days
tte_py                  = tte_py.dropna(subset=["tte_days"])
tte_py                  = tte_py[tte_py["tte_days"] > 0].copy()
tte_py["event"]         = 1

print(f"Inter-incident intervals available: {len(tte_py):,}")

Inter-incident intervals available: 1,084

Code

print(f"Unique facilities with repeat incidents: {tte_py['facility'].nunique():,}")

Unique facilities with repeat incidents: 147

Code

print(f"Median inter-incident time: {tte_py['tte_days'].median():.0f} days")

Median inter-incident time: 13 days

Code

print(f"Mean inter-incident time:   {tte_py['tte_days'].mean():.1f} days")

Mean inter-incident time:   117.9 days

Code

# ── Step 2: KM curves by facility category ────────────────────────
top_cats_py = (
    tte_py.dropna(subset=["facility_cat"])
    .groupby("facility_cat").size()
    .sort_values(ascending=False)
    .head(5).index.tolist()
)

fig, ax = plt.subplots(figsize=(10, 5))
kmf_py  = KaplanMeierFitter()
palette = ["#2c7fb8", "#e05c2a", "#27ae60", "#9b59b6", "#e67e22"]

for i, cat in enumerate(top_cats_py):
    sub = tte_py[tte_py["facility_cat"] == cat]
    if len(sub) >= 5:
        kmf_py.fit(sub["tte_days"],
                   event_observed=sub["event"],
                   label=str(cat))
        kmf_py.plot_survival_function(
            ax=ax, ci_show=False,
            color=palette[i % len(palette)])

<lifelines.KaplanMeierFitter:"Pipeline (Oil)", fitted with 367 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"Flowline", fitted with 113 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"Pipeline (Gas)", fitted with 6 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>

Code

ax.set_xlim(0, 500)

(0.0, 500.0)

Code

ax.set_title("Kaplan-Meier: Time Between Incidents by Facility Category",
             fontsize=12, fontweight="bold")

Text(0.5, 1.0, 'Kaplan-Meier: Time Between Incidents by Facility Category')

Code

ax.set_xlabel("Days Since Previous Incident")

Text(0.5, 0, 'Days Since Previous Incident')

Code

ax.set_ylabel("Probability of Remaining Incident-Free")

Text(0, 0.5, 'Probability of Remaining Incident-Free')

Code

ax.grid(True, alpha=0.3)
ax.legend(title="Facility Category", fontsize=8)

<matplotlib.legend.Legend object at 0x00000173F4289DD0>

Code

plt.tight_layout()
plt.show()

Code

# ── Step 3: KM by operational area with log-rank test ────────────
lar = tte_py[tte_py["area"] == "LAR"]
sar = tte_py[tte_py["area"] == "SAR"]

lr = logrank_test(
    lar["tte_days"], sar["tte_days"],
    event_observed_A=lar["event"],
    event_observed_B=sar["event"]
)

fig, ax = plt.subplots(figsize=(10, 5))
for area, data, col in [("LAR", lar, "#2c7fb8"),
                         ("SAR", sar, "#e05c2a")]:
    kmf_a = KaplanMeierFitter()
    kmf_a.fit(data["tte_days"],
              event_observed=data["event"],
              label=area)
    kmf_a.plot_survival_function(ax=ax, ci_show=True,
                                  color=col, alpha=0.8)

<lifelines.KaplanMeierFitter:"LAR", fitted with 446 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>
<lifelines.KaplanMeierFitter:"SAR", fitted with 611 total observations, 0 right-censored observations>
<Axes: xlabel='timeline'>

Code

ax.set_xlim(0, 500)

(0.0, 500.0)

Code

ax.set_title(
    "Kaplan-Meier: Time Between Incidents by Operational Area (LAR vs SAR)",
    fontsize=12, fontweight="bold")

Text(0.5, 1.0, 'Kaplan-Meier: Time Between Incidents by Operational Area (LAR vs SAR)')

Code

ax.set_xlabel("Days Since Previous Incident")

Text(0.5, 0, 'Days Since Previous Incident')

Code

ax.set_ylabel("Probability of Remaining Incident-Free")

Text(0, 0.5, 'Probability of Remaining Incident-Free')

Code

ax.text(0.98, 0.98, f"Log-rank p = {lr.p_value:.4f}",
        transform=ax.transAxes, ha="right", va="top",
        fontsize=10,
        bbox=dict(boxstyle="round,pad=0.3",
                  facecolor="white", alpha=0.8))

Text(0.98, 0.98, 'Log-rank p = 0.0000')

Code

ax.grid(True, alpha=0.3)
ax.legend()

<matplotlib.legend.Legend object at 0x00000173F4243A90>

Code

plt.tight_layout()
plt.show()

Code

print(f"\nLog-rank LAR vs SAR: statistic={lr.test_statistic:.3f}, "
      f"p={lr.p_value:.4f}")


Log-rank LAR vs SAR: statistic=240.625, p=0.0000

Code

# ── Step 4: Cox Proportional Hazards model ────────────────────────
cox_data_py = (
    tte_py[["tte_days", "event", "facility_cat", "state"]]
    .dropna()
    .query("facility_cat in @top_cats_py")
    .copy()
)

cox_dummies = pd.get_dummies(
    cox_data_py,
    columns=["facility_cat", "state"],
    drop_first=True
)

cph = CoxPHFitter()
cph.fit(cox_dummies, duration_col="tte_days", event_col="event")

<lifelines.CoxPHFitter: fitted with 490 total observations, 0 right-censored observations>

Code

print("\nCox PH Model Summary:")


Cox PH Model Summary:

Code

cph.print_summary()

<lifelines.CoxPHFitter: fitted with 490 total observations, 0 right-censored observations>
             duration col = 'tte_days'
                event col = 'event'
      baseline estimation = breslow
   number of observations = 490
number of events observed = 490
   partial log-likelihood = -2465.47
         time fit was run = 2026-05-12 14:59:53 UTC

---
                             coef exp(coef)  se(coef)  coef lower 95%  coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
covariate                                                                                                                    
facility_cat_Manifold        0.27      1.32      1.01           -1.70            2.25                0.18                9.45
facility_cat_Pipeline (Gas) -0.88      0.42      0.43           -1.72           -0.03                0.18                0.97
facility_cat_Pipeline (Oil)  0.82      2.26      0.13            0.57            1.06                1.77                2.90
facility_cat_Well Head      -1.27      0.28      0.59           -2.43           -0.11                0.09                0.90
state_Delta                 -1.30      0.27      0.36           -2.01           -0.59                0.13                0.55
state_Imo                   -1.82      0.16      0.46           -2.71           -0.92                0.07                0.40
state_Rivers                -0.61      0.54      0.12           -0.84           -0.38                0.43                0.68

                             cmp to     z      p  -log2(p)
covariate                                                 
facility_cat_Manifold          0.00  0.27   0.79      0.35
facility_cat_Pipeline (Gas)    0.00 -2.02   0.04      4.53
facility_cat_Pipeline (Oil)    0.00  6.47 <0.005     33.26
facility_cat_Well Head         0.00 -2.15   0.03      4.97
state_Delta                    0.00 -3.59 <0.005     11.59
state_Imo                      0.00 -3.98 <0.005     13.81
state_Rivers                   0.00 -5.19 <0.005     22.19
---
Concordance = 0.68
Partial AIC = 4944.95
log-likelihood ratio test = 167.60 on 7 df
-log2(p) of ll-ratio test = 106.62

Code

fig, ax = plt.subplots(figsize=(9, 5))
cph.plot(ax=ax)

<Axes: xlabel='log(HR) (95% CI)'>

Code

ax.set_title(
    "Cox PH Model: Hazard Ratios by Facility Category and State",
    fontsize=11, fontweight="bold")

Text(0.5, 1.0, 'Cox PH Model: Hazard Ratios by Facility Category and State')

Code

ax.axvline(0, linestyle="--", color="grey", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x00000173F4132450>

Code

plt.tight_layout()
plt.show()

Code

# ── Step 5: High-recurrence facility table ────────────────────────
high_risk_py = (
    tte_py.groupby(["facility", "state", "area", "facility_cat"])
    .agg(
        n_intervals     = ("tte_days", "count"),
        median_tte_days = ("tte_days", "median"),
        mean_tte_days   = ("tte_days", "mean")
    )
    .reset_index()
    .query("n_intervals >= 3")
    .sort_values("median_tte_days")
    .head(20)
    .round(1)
)

print("\nTop 20 Highest-Recurrence Facilities:")


Top 20 Highest-Recurrence Facilities:

Code

print(high_risk_py.to_string(index=False))

                               facility   state area   facility_cat  n_intervals  median_tte_days  mean_tte_days
               Ebocha 4Ls & Ss flowline  Rivers  LAR       Flowline            4              1.5            1.5
            18" Tebidaba/Brass Pipeline Bayelsa  SAR Pipeline (Oil)          131              2.0            6.8
     10" Clough Creek/Tebidaba Pipeline Bayelsa  SAR Pipeline (Oil)          141              3.0            5.6
                 18" EOC/Ogoda Pipeline  Rivers  LAR Pipeline (Oil)           18             10.5           45.4
          6'' Tuomo/Ogboinbiri flowline Bayelsa  SAR       Flowline           10             10.5           54.3
               24" Ogoda/Brass Pipeline Bayelsa  SAR Pipeline (Oil)            6             13.5           33.7
                     Ebocha 5T flowline  Rivers  LAR       Flowline            7             16.0          249.7
               24" Ogoda/Brass Pipeline  Rivers  LAR Pipeline (Oil)           19             18.0           29.1
       14" Ogboinbiri/Tebidaba pipeline Bayelsa  SAR Pipeline (Oil)           14             20.0           84.9
                   Ebegoro 11T flowline  Rivers  LAR       Flowline            9             28.0           27.6
                  Ebegoro 12Ls flowline  Rivers  LAR       Flowline            6             31.5          132.5
              Taylor Creek 2Ls flowline Bayelsa  LAR       Flowline            4             45.0           44.2
                    Oshie 15Ss flowline  Rivers  LAR       Flowline            5             69.0          110.2
               18" Obama/Brass Pipeline Bayelsa  SAR Pipeline (Oil)            6             69.5          208.7
               10" Oshie/Ogoda Pipeline  Rivers  LAR Pipeline (Oil)            5             80.0          136.6
              Taylor Creek 2Ss flowline Bayelsa  LAR       Flowline            5             87.0          130.0
14'' Ogboinbiri/Tebidaba pipeline riser Bayelsa  SAR Pipeline (Oil)            3             94.0          118.0
                10" Kwale/Akri Pipeline   Delta  LAR Pipeline (Oil)            4             96.5          163.2
                    Ebegoro 2T flowline  Rivers  LAR       Flowline            3            115.0          162.7
       6'' Ogboinbiri/Tebidaba pipeline Bayelsa  SAR Pipeline (Oil)            4            150.5          252.8

8.4 Management Interpretation

What inter-incident time means and why it matters more than incident count

The total number of incidents at a facility tells you how much has gone wrong historically. The time between incidents tells you how quickly things go wrong again after a repair or intervention. Consider two facilities, each with ten incidents over seven years:

Facility A experiences incidents spread evenly at roughly ten-month intervals — a low but steady recurrence pattern suggesting manageable wear and predictable maintenance needs
Facility B experiences all ten incidents within eighteen months, followed by a long quiet period — a structural vulnerability that burst into recurring failure before being addressed

Survival analysis separates these two profiles explicitly. It is Facility B that should command priority in the 2026 inspection programme, not simply the facility with the highest total count.

What the Kaplan-Meier curves reveal

Each curve shows the probability that a facility of a given type or in a given area will remain incident-free as time passes after a previous incident. Reading the curves operationally:

A curve that drops to 50% within 60 days means half of all incidents at that facility type are followed by another incident within two months. This is a structural recurrence pattern
A curve that remains above 80% for 200 or more days means the asset class is generally resilient and re-incidents are relatively infrequent
The gap between the LAR and SAR curves, combined with the log-rank p-value, determines whether one operational area carries a structurally higher recurrence risk than the other and should therefore receive a larger share of inspection resources

What the Cox model adds

The Kaplan-Meier curves show what is happening visually. The Cox model quantifies the effect of each covariate while holding all others constant. Three practical implications follow from the hazard ratio table:

A hazard ratio above 1.0 for a facility category means that asset class re-incidents faster than the reference category, all else being equal. The magnitude of the ratio indicates how much faster
A statistically significant ratio (p less than 0.05) means the difference is unlikely to be explained by chance and can be used to justify differentiated budget allocation
A non-significant ratio means the difference in recurrence rate between that category and the reference is not reliably distinguishable from random variation in the data

What the proportional hazards test confirms

The Schoenfeld residuals test checks whether the hazard ratios are stable over time. A non-significant p-value for each covariate confirms that the ratios hold across the full 2019 to 2025 observation window and can be applied to the 2026 planning horizon with confidence. If any covariate fails the test, its hazard ratio should be interpreted with caution and a time-varying coefficient model considered as an extension in future work.

Recommendation feeding into Section 10

Two concrete outputs from this section feed into the integrated recommendation:

The facilities in the bottom quartile of median inter-incident time from Table SA2 constitute the mandatory inclusion list for the 2026 Q1 integrity inspection programme
The facility category and state with the highest statistically significant hazard ratio from the Cox model should receive a proportionally larger share of the integrity inspection budget relative to lower-hazard asset classes

9. Technique 5: Association Rules (Operational Co-occurrence Analysis)

9.1 Technique Overview

Association rule mining is a data mining technique designed to discover interesting relationships, co-occurrence patterns, and dependencies between variables in large transactional datasets (Marr, 2017; Agrawal and Srikant, 1994). Originally developed for market basket analysis — identifying which products customers tend to purchase together — the technique has broad applicability wherever the goal is to find combinations of attributes that co-occur more frequently than chance would predict. In an operational context, it answers the question: given that an incident has a particular cause, location, and facility type, what else is reliably true about it?

The Apriori algorithm, the standard implementation used here, works in three stages:

Identify frequent itemsets: All attribute combinations that meet a minimum frequency threshold (support) are retained. The minimum support of 0.02 used here ensures each rule covers at least 34 of the 1,695 incidents, providing a statistically reliable foundation
Evaluate confidence: For each frequent itemset, the algorithm assesses how reliably the antecedent attributes predict the consequent attribute. Only combinations above the minimum confidence threshold of 0.60 are retained as rules
Rank by lift: Rules are ranked by how much more likely the consequent is given the antecedent than it would be by chance alone. Lift is the primary operational metric because it identifies genuinely non-obvious co-occurrences rather than simply frequent ones

Three metrics define every rule:

Support: The proportion of all incidents in which the complete rule, both antecedent and consequent, appears together. A support of 0.05 means the combination appears in at least 5% of all 1,695 incidents. Low support rules may be statistically unreliable and are excluded by the minimum threshold
Confidence: The conditional probability that the consequent appears given the antecedent. A confidence of 0.80 for the rule {Facility: Pipeline (Oil), Area: LAR} → {Sabotage} means that in 80% of incidents where the facility is an oil pipeline in the Lagos Area, the incident is classified as Sabotage. High confidence identifies reliable operational patterns
Lift: The ratio of the observed confidence to the confidence that would be expected if the antecedent and consequent were statistically independent. A lift of 3.0 means the consequent is three times more likely given the antecedent than it would be by chance. A lift of 1.0 means the rule adds no predictive value beyond the base rate

The minimum lift threshold of 1.2 and the focus on rules with Sabotage as the consequent are both operationally motivated. The goal is to identify attribute combinations that are meaningfully predictive of the incident type most damaging to production continuity.

9.2 Operational Justification

Sabotage accounts for 74% of all incidents in the OERNL dataset, but not all sabotage incidents are equivalent in their operational impact. The structured fields — cause, location, facility category, area, state — are currently reviewed individually during incident investigations. They have never been analysed simultaneously to identify which combinations of attributes reliably co-occur with high-severity sabotage outcomes.

Association rule mining addresses this gap in three ways:

Security targeting: Rules with high lift linking specific cause types and locations to Sabotage classification identify the geographic and operational contexts where third-party interference is most concentrated. These rules feed directly into the security patrol schedule and community engagement targeting strategy for 2026
Project scope refinement: Rules linking facility category and cause type to cleanup requirements or production deferment inform the scope of capital maintenance projects. If a high-lift rule identifies that hacksaw cuts at oil pipelines in Bayelsa reliably require cleanup, that combination should be explicitly scoped in the Kwale Flaredown and Brass remediation project budgets rather than treated as a contingency item
Emergency response protocol design: Frequent itemsets linking incident type combinations to response actions allow the department to pre-position resources and design differentiated response protocols rather than applying a single generic response to all sabotage incidents regardless of their specific attribute profile

This section addresses sub-question (e) of the central analytical question: which combinations of cause type, location, and facility category are most strongly associated with Sabotage classification, and how should these combinations inform the 2026 security and community engagement targeting strategy?

9.3 Analysis and Outputs

Code

# ── Step 1: Build transaction dataset ────────────────────────────
tx <- incidents %>%
  dplyr::transmute(
    cause        = stringr::word(stringr::str_to_title(
                     stringr::str_squish(cause)), 1, 4),
    facility_cat = as.character(facility_cat),
    area         = as.character(area),
    state        = as.character(state),
    cleanup      = as.character(cleanup_required),
    category     = as.character(category)
  ) %>%
  mutate(across(everything(),
                ~ ifelse(is.na(.x) | .x == "NA", "Unknown", .x))) %>%
  mutate(
    cause        = paste0("cause=",    cause),
    facility_cat = paste0("fac=",      facility_cat),
    area         = paste0("area=",     area),
    state        = paste0("state=",    state),
    cleanup      = paste0("cleanup=",  cleanup),
    category     = paste0("category=", category)
  )

# ── Step 2: Convert to transactions object ────────────────────────
trans <- as(tx, "transactions")

# ── Step 3: Inspect most frequent items ──────────────────────────
item_freq <- itemFrequency(trans, type = "relative")
top_items <- sort(item_freq, decreasing = TRUE)[1:20]

clean_item_label <- function(x) {
  x %>%
    stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>%
    stringr::str_replace_all("cause=cause=",       "Cause: ")   %>%
    stringr::str_replace_all("area=area=",         "Area: ")    %>%
    stringr::str_replace_all("state=state=",       "State: ")   %>%
    stringr::str_replace_all("cleanup=cleanup=",   "Cleanup: ") %>%
    stringr::str_replace_all("category=category=", "Category: ") %>%
    stringr::str_replace_all("fac=",               "Facility: ") %>%
    stringr::str_replace_all("cause=",             "Cause: ")   %>%
    stringr::str_replace_all("area=",              "Area: ")    %>%
    stringr::str_replace_all("state=",             "State: ")   %>%
    stringr::str_replace_all("cleanup=",           "Cleanup: ") %>%
    stringr::str_replace_all("category=",          "Category: ")
}

top_items_df <- tibble::tibble(
  item      = clean_item_label(names(top_items)),
  frequency = as.numeric(top_items)
)

ggplot(top_items_df,
       aes(x = frequency,
           y = forcats::fct_reorder(item, frequency))) +
  geom_col(fill = "#2c7fb8", width = 0.7) +
  geom_text(aes(label = scales::percent(frequency, accuracy = 0.1)),
            hjust = -0.1, size = 3, fontface = "bold") +
  scale_x_continuous(labels = scales::percent,
                     expand  = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Top 20 Most Frequent Items Across All Incident Transactions",
    subtitle = "Item frequency as proportion of all 1,695 incidents",
    x        = "Support (Proportion of Incidents)",
    y        = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title         = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

Code

# ── Step 4: Mine association rules ────────────────────────────────
rules <- apriori(
  trans,
  parameter = list(
    supp   = 0.02,
    conf   = 0.60,
    minlen = 2,
    maxlen = 4
  ),
  control = list(verbose = FALSE)
)

# ── Step 5: Filter to rules with Sabotage as consequent ──────────
sabotage_rules <- subset(
  rules,
  subset = rhs %pin% "category=Sabotage" & lift > 1.2
)

top_rules <- head(sort(sabotage_rules, by = "lift"), 20)

# ── Step 6: Display top rules as clean table ──────────────────────
clean_rule_text <- function(x) {
  x %>%
    stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>%
    stringr::str_replace_all("cause=cause=",       "Cause: ")   %>%
    stringr::str_replace_all("area=area=",         "Area: ")    %>%
    stringr::str_replace_all("state=state=",       "State: ")   %>%
    stringr::str_replace_all("cleanup=cleanup=",   "Cleanup: ") %>%
    stringr::str_replace_all("category=category=", "")          %>%
    stringr::str_replace_all("fac=",               "Facility: ") %>%
    stringr::str_replace_all("cause=",             "Cause: ")   %>%
    stringr::str_replace_all("area=",              "Area: ")    %>%
    stringr::str_replace_all("state=",             "State: ")   %>%
    stringr::str_replace_all("cleanup=",           "Cleanup: ") %>%
    stringr::str_replace_all("category=",          "")
}

rules_df <- as(top_rules, "data.frame") %>%
  dplyr::arrange(desc(lift)) %>%
  dplyr::mutate(
    Rule       = clean_rule_text(as.character(rules)),
    Support    = round(support, 4),
    Confidence = round(confidence, 3),
    Lift       = round(lift, 3),
    Count      = as.integer(count)
  ) %>%
  dplyr::select(Rule, Support, Confidence, Lift, Count)

kableExtra::kable(rules_df,
                  col.names = c("Rule", "Support", "Confidence",
                                "Lift", "Count"),
                  caption   = "Table AR1: Top Association Rules — Sabotage as Consequent (Ranked by Lift)") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE,
    font_size         = 11
  ) %>%
  kableExtra::column_spec(1, width = "55%") %>%
  kableExtra::column_spec(4, bold = TRUE, color = "#e05c2a") %>%
  kableExtra::footnote(
    general = "Support = proportion of all 1,695 incidents containing the full rule. Confidence = probability of Sabotage given the antecedent conditions. Lift > 1.0 = association stronger than chance."
  )

Table AR1: Top Association Rules — Sabotage as Consequent (Ranked by Lift)
Rule	Support	Confidence	Lift	Count
{Facility: Pipeline (Oil),Area: LAR,Cleanup: YES} => {Sabotage}	0.0313	0.964	1.305	53
{Facility: Pipeline (Oil),State: Rivers,Cleanup: YES} => {Sabotage}	0.0271	0.958	1.297	46
{Facility: Unknown,Area: SAR,Cleanup: YES} => {Sabotage}	0.0401	0.958	1.297	68
{Facility: Unknown,State: Rivers,Cleanup: YES} => {Sabotage}	0.0460	0.951	1.288	78
{Facility: Unknown,State: Bayelsa,Cleanup: YES} => {Sabotage}	0.0454	0.928	1.256	77
{Facility: Unknown,Cleanup: YES} => {Sabotage}	0.0956	0.920	1.246	162
{Cause: Unknown,Facility: Unknown,Cleanup: YES} => {Sabotage}	0.0944	0.920	1.245	160
{Facility: Pipeline (Oil),Cleanup: YES} => {Sabotage}	0.0844	0.911	1.233	143
{Cause: Unknown,State: Rivers,Cleanup: YES} => {Sabotage}	0.1068	0.910	1.231	181
{Cause: Unknown,Facility: Pipeline (Oil),Cleanup: YES} => {Sabotage}	0.0826	0.909	1.231	140
{State: Rivers,Cleanup: YES} => {Sabotage}	0.1068	0.905	1.225	181
{Area: LAR,State: Rivers,Cleanup: YES} => {Sabotage}	0.1068	0.905	1.225	181
{Facility: Pipeline (Oil),State: Rivers} => {Sabotage}	0.0324	0.902	1.221	55
{Facility: Pipeline (Oil),Area: LAR,State: Rivers} => {Sabotage}	0.0324	0.902	1.221	55
{Cause: Unknown,Facility: Pipeline (Oil),State: Rivers} => {Sabotage}	0.0324	0.902	1.221	55
{Facility: Flowline,State: Rivers,Cleanup: YES} => {Sabotage}	0.0319	0.900	1.218	54
{Facility: Pipeline (Oil),Area: SAR} => {Sabotage}	0.1923	0.898	1.216	326
{Facility: Pipeline (Oil),State: Bayelsa} => {Sabotage}	0.1923	0.898	1.216	326
{Facility: Pipeline (Oil),Area: SAR,State: Bayelsa} => {Sabotage}	0.1923	0.898	1.216	326
{Facility: Pipeline (Oil),Area: SAR,Cleanup: Unknown} => {Sabotage}	0.1286	0.897	1.215	218
Note:
Support = proportion of all 1,695 incidents containing the full rule. Confidence = probability of Sabotage given the antecedent conditions. Lift > 1.0 = association stronger than chance.

Code

# ── Step 7: All rules scatter plot ───────────────────────────────
all_rules_df <- as(rules, "data.frame") %>%
  dplyr::mutate(
    is_sabotage = stringr::str_detect(rules, "category=Sabotage")
  )

ggplot(all_rules_df,
       aes(x = support, y = confidence,
           colour = lift, size = lift,
           alpha  = is_sabotage)) +
  geom_point() +
  scale_colour_gradient(low  = "#a8c8e8",
                        high = "#e05c2a",
                        name = "Lift") +
  scale_size_continuous(range = c(0.5, 4), name = "Lift") +
  scale_alpha_manual(values = c("FALSE" = 0.2, "TRUE" = 0.8),
                     name   = "Sabotage Rule") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title    = "Association Rules: Support vs Confidence (Colour = Lift)",
    subtitle = "Orange-highlighted points = rules with Sabotage as consequent",
    x        = "Support",
    y        = "Confidence"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Code

# ── Step 8: Interactive network graph via arulesViz ───────────────
grDevices::pdf(nullfile())
invisible(capture.output({
  network_widget <- plot(top_rules,
                         method  = "graph",
                         engine  = "htmlwidget",
                         control = list(
                           main    = "Association Rule Network: Top 20 Sabotage Rules by Lift",
                           verbose = FALSE
                         ))
}))
grDevices::dev.off()

png 
  2

Code

widget_path <- here::here("output", "ar_network.html")
htmlwidgets::saveWidget(network_widget, widget_path,
                        selfcontained = TRUE)
htmltools::tags$iframe(
  src    = widget_path,
  width  = "100%",
  height = "550px",
  style  = "border:none;"
)

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

inc_py = r.incidents_py

cols = ["cause", "facility_cat", "area", "state",
        "cleanup_required", "category"]

tx_py = inc_py[cols].fillna("Unknown").astype(str).copy()

tx_py["cause"] = tx_py["cause"].str.split().str[:4].str.join(" ")

prefix_map = {
    "cause":            "cause=",
    "facility_cat":     "fac=",
    "area":             "area=",
    "state":            "state=",
    "cleanup_required": "cleanup=",
    "category":         "category="
}
for col, prefix in prefix_map.items():
    tx_py[col] = prefix + tx_py[col]

tx_list = tx_py.values.tolist()

te = TransactionEncoder()
te.fit(tx_list)

TransactionEncoder()

Code

X  = pd.DataFrame(te.transform(tx_list), columns=te.columns_)

freq_items           = apriori(X, min_support=0.02, use_colnames=True)
freq_items["length"] = freq_items["itemsets"].apply(len)

rules_py = association_rules(
    freq_items,
    metric        = "lift",
    min_threshold = 1.2
)

rules_py["antecedents_str"] = rules_py["antecedents"].apply(
    lambda x: ", ".join(sorted(x))
)
rules_py["consequents_str"] = rules_py["consequents"].apply(
    lambda x: ", ".join(sorted(x))
)

sab_rules = rules_py[
    rules_py["consequents_str"].str.contains("Sabotage", na=False)
].copy()

top_sab = (sab_rules
           .sort_values("lift", ascending=False)
           .head(20)
           [["antecedents_str", "consequents_str",
             "support", "confidence", "lift"]]
           .round(4))

def clean_label_py(x):
    return (x.replace("facility_cat=fac=", "Facility: ")
             .replace("cause=cause=",       "Cause: ")
             .replace("area=area=",         "Area: ")
             .replace("state=state=",       "State: ")
             .replace("cleanup=cleanup=",   "Cleanup: ")
             .replace("category=category=", "")
             .replace("fac=",               "Facility: ")
             .replace("cause=",             "Cause: ")
             .replace("area=",              "Area: ")
             .replace("state=",             "State: ")
             .replace("cleanup=",           "Cleanup: ")
             .replace("category=",          ""))

top_sab["antecedents_str"] = top_sab["antecedents_str"].apply(
    clean_label_py)
top_sab["consequents_str"] = top_sab["consequents_str"].apply(
    clean_label_py)

print("Top 20 Sabotage Rules by Lift:")

Top 20 Sabotage Rules by Lift:

Code

print(top_sab.to_string(index=False))

                                          antecedents_str                                                     consequents_str  support  confidence    lift
                Area: LAR, Cause: Oil Theft, Cleanup: YES                   Sabotage, Facility: Pipeline (Oil), State: Rivers   0.0212      0.6792 20.9331
            Cause: Oil Theft, Cleanup: YES, State: Rivers                       Area: LAR, Sabotage, Facility: Pipeline (Oil)   0.0212      0.7826 19.5077
                  Facility: Pipeline (Oil), State: Rivers                 Area: LAR, Sabotage, Cause: Oil Theft, Cleanup: YES   0.0212      0.5902 18.8741
                      Area: LAR, Facility: Pipeline (Oil)             Sabotage, Cause: Oil Theft, Cleanup: YES, State: Rivers   0.0212      0.4675 17.2276
                          Cause: Oil Theft, State: Rivers         Area: LAR, Sabotage, Cleanup: YES, Facility: Pipeline (Oil)   0.0212      0.4865 15.5584
        Area: LAR, Cleanup: YES, Facility: Pipeline (Oil)                           Sabotage, Cause: Oil Theft, State: Rivers   0.0212      0.6545 14.9926
                          Cause: Oil Theft, State: Rivers                       Area: LAR, Sabotage, Facility: Pipeline (Oil)   0.0236      0.5405 13.4738
                              Area: LAR, Cause: Oil Theft     Sabotage, Cleanup: YES, Facility: Pipeline (Oil), State: Rivers   0.0212      0.3495 12.8789
    Cleanup: YES, Facility: Pipeline (Oil), State: Rivers                               Area: LAR, Sabotage, Cause: Oil Theft   0.0212      0.7500 12.3422
                              Area: LAR, Cause: Oil Theft                   Sabotage, Facility: Pipeline (Oil), State: Rivers   0.0236      0.3883 11.9682
                      Area: LAR, Facility: Pipeline (Oil)                           Sabotage, Cause: Oil Theft, State: Rivers   0.0236      0.5195 11.8989
                  Facility: Pipeline (Oil), State: Rivers                               Area: LAR, Sabotage, Cause: Oil Theft   0.0236      0.6557 10.7910
Cause: Oil Theft, Facility: Pipeline (Oil), State: Rivers                                   Area: LAR, Sabotage, Cleanup: YES   0.0212      0.9000  7.1285
    Area: LAR, Cause: Oil Theft, Facility: Pipeline (Oil)                               Sabotage, Cleanup: YES, State: Rivers   0.0212      0.7500  7.0235
                           Cause: Oil Theft, Cleanup: YES        Area: LAR, Sabotage, Facility: Pipeline (Oil), State: Rivers   0.0212      0.2156  6.6434
                           Cause: Oil Theft, Cleanup: YES                   Sabotage, Facility: Pipeline (Oil), State: Rivers   0.0212      0.2156  6.6434
                              Cleanup: YES, State: Rivers     Area: LAR, Sabotage, Cause: Oil Theft, Facility: Pipeline (Oil)   0.0212      0.1800  6.3562
                                  Area: LAR, Cleanup: YES Sabotage, Cause: Oil Theft, Facility: Pipeline (Oil), State: Rivers   0.0212      0.1457  6.1761
                  Facility: Pipeline (Oil), State: Rivers                            Sabotage, Cause: Oil Theft, Cleanup: YES   0.0212      0.5902  5.9900
       Area: LAR, Facility: Pipeline (Oil), State: Rivers                            Sabotage, Cause: Oil Theft, Cleanup: YES   0.0212      0.5902  5.9900

Code

fig, ax = plt.subplots(figsize=(10, 6))

non_sab = rules_py[
    ~rules_py["consequents_str"].str.contains("Sabotage", na=False)
]
sab_sub = rules_py[
    rules_py["consequents_str"].str.contains("Sabotage", na=False)
]

ax.scatter(non_sab["support"], non_sab["confidence"],
           c=non_sab["lift"], cmap="Blues",
           alpha=0.3, s=20, label="Other rules")

<matplotlib.collections.PathCollection object at 0x00000173F6569E90>

Code

sc2 = ax.scatter(sab_sub["support"], sab_sub["confidence"],
                 c=sab_sub["lift"], cmap="Oranges",
                 alpha=0.8, s=40, label="Sabotage rules")

plt.colorbar(sc2, ax=ax, label="Lift")

<matplotlib.colorbar.Colorbar object at 0x00000173F6621DD0>

Code

ax.set_title(
    "Association Rules: Support vs Confidence\n"
    "(Orange = Sabotage consequent, Colour intensity = Lift)",
    fontsize=11, fontweight="bold"
)

Text(0.5, 1.0, 'Association Rules: Support vs Confidence\n(Orange = Sabotage consequent, Colour intensity = Lift)')

Code

ax.set_xlabel("Support")

Text(0.5, 0, 'Support')

Code

ax.set_ylabel("Confidence")

Text(0, 0.5, 'Confidence')

Code

ax.xaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.legend()

<matplotlib.legend.Legend object at 0x00000173F656B510>

Code

ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Code

top15 = top_sab.head(15).copy()
top15["rule_label"] = (
    top15["antecedents_str"].str[:45] + " => Sabotage"
)

fig, ax = plt.subplots(figsize=(11, 6))
bars = ax.barh(range(len(top15)), top15["lift"],
               color="#e05c2a", alpha=0.85, height=0.6)
ax.set_yticks(range(len(top15)))

[<matplotlib.axis.YTick object at 0x00000173F415E7D0>, <matplotlib.axis.YTick object at 0x00000173F6378090>, <matplotlib.axis.YTick object at 0x00000173F4201F50>, <matplotlib.axis.YTick object at 0x00000173F4201150>, <matplotlib.axis.YTick object at 0x00000173F4189F90>, <matplotlib.axis.YTick object at 0x00000173F4188190>, <matplotlib.axis.YTick object at 0x00000173F41BFDD0>, <matplotlib.axis.YTick object at 0x00000173F41BE590>, <matplotlib.axis.YTick object at 0x00000173802A1150>, <matplotlib.axis.YTick object at 0x00000173F42A1150>, <matplotlib.axis.YTick object at 0x00000173F41BCC50>, <matplotlib.axis.YTick object at 0x00000173F42530D0>, <matplotlib.axis.YTick object at 0x00000173F4282E90>, <matplotlib.axis.YTick object at 0x00000173F63ED6D0>, <matplotlib.axis.YTick object at 0x00000173F41D2E90>]

Code

ax.set_yticklabels(top15["rule_label"], fontsize=8)

[Text(0, 0, 'Area: LAR, Cause: Oil Theft, Cleanup: YES => Sabotage'), Text(0, 1, 'Cause: Oil Theft, Cleanup: YES, State: Rivers => Sabotage'), Text(0, 2, 'Facility: Pipeline (Oil), State: Rivers => Sabotage'), Text(0, 3, 'Area: LAR, Facility: Pipeline (Oil) => Sabotage'), Text(0, 4, 'Cause: Oil Theft, State: Rivers => Sabotage'), Text(0, 5, 'Area: LAR, Cleanup: YES, Facility: Pipeline ( => Sabotage'), Text(0, 6, 'Cause: Oil Theft, State: Rivers => Sabotage'), Text(0, 7, 'Area: LAR, Cause: Oil Theft => Sabotage'), Text(0, 8, 'Cleanup: YES, Facility: Pipeline (Oil), State => Sabotage'), Text(0, 9, 'Area: LAR, Cause: Oil Theft => Sabotage'), Text(0, 10, 'Area: LAR, Facility: Pipeline (Oil) => Sabotage'), Text(0, 11, 'Facility: Pipeline (Oil), State: Rivers => Sabotage'), Text(0, 12, 'Cause: Oil Theft, Facility: Pipeline (Oil), S => Sabotage'), Text(0, 13, 'Area: LAR, Cause: Oil Theft, Facility: Pipeli => Sabotage'), Text(0, 14, 'Cause: Oil Theft, Cleanup: YES => Sabotage')]

Code

ax.set_xlabel("Lift")

Text(0.5, 0, 'Lift')

Code

ax.set_title(
    "Top 15 Association Rules with Sabotage as Consequent (Ranked by Lift)",
    fontsize=11, fontweight="bold"
)

Text(0.5, 1.0, 'Top 15 Association Rules with Sabotage as Consequent (Ranked by Lift)')

Code

ax.axvline(1.0, linestyle="--", color="grey", linewidth=0.8)

<matplotlib.lines.Line2D object at 0x00000173F414FC90>

Code

ax.invert_yaxis()

for bar, conf in zip(bars, top15["confidence"]):
    ax.text(bar.get_width() + 0.005,
            bar.get_y() + bar.get_height() / 2,
            f"conf={conf:.2f}",
            va="center", fontsize=7)

Text(20.9381, 0.0, 'conf=0.68')
Text(19.5127, 1.0, 'conf=0.78')
Text(18.879099999999998, 2.0, 'conf=0.59')
Text(17.232599999999998, 3.0, 'conf=0.47')
Text(15.563400000000001, 4.0, 'conf=0.49')
Text(14.9976, 5.0, 'conf=0.65')
Text(13.478800000000001, 6.0, 'conf=0.54')
Text(12.8839, 7.0, 'conf=0.35')
Text(12.3472, 8.0, 'conf=0.75')
Text(11.9732, 9.0, 'conf=0.39')
Text(11.9039, 10.0, 'conf=0.52')
Text(10.796000000000001, 11.0, 'conf=0.66')
Text(7.1335, 12.0, 'conf=0.90')
Text(7.0285, 13.0, 'conf=0.75')
Text(6.6484, 14.0, 'conf=0.22')

Code

plt.tight_layout()
plt.show()

Code

print("\nSummary statistics of all Sabotage rules:")


Summary statistics of all Sabotage rules:

Code

print(sab_rules[["support", "confidence", "lift"]].describe().round(4))

         support  confidence       lift
count  1362.0000   1362.0000  1362.0000
mean      0.0637      0.4908     2.2743
std       0.0532      0.3175     1.4935
min       0.0201      0.0399     1.2004
25%       0.0283      0.2048     1.5541
50%       0.0383      0.4432     2.0257
75%       0.0802      0.7541     2.5354
max       0.3841      1.0000    20.9331

9.4 Management Interpretation

What association rules are and how to read them

Association rule mining works like an experienced investigator who has read all 1,695 incident reports and noticed that certain combinations of facts tend to appear together far more often than chance would predict. The algorithm formalises that intuition by scanning every possible combination of cause type, location, facility category, area, state, and cleanup requirement, and surfacing only those combinations that are both frequent enough to be reliable and strong enough to be surprising.

Each rule comes with three numbers that tell a precise operational story:

Support answers: how often does this exact combination appear across all 1,695 incidents? A support of 0.03 means the combination appears in about 51 incidents — enough to be statistically reliable but specific enough to be operationally targeted
Confidence answers: when the antecedent conditions are present, how reliably does Sabotage follow? A confidence of 0.96 means sabotage classification follows in 96 out of every 100 incidents with that attribute combination
Lift answers: how much more predictive is this combination than the overall sabotage rate alone? A lift of 1.3 means the combination is 30% more predictive of sabotage than the base rate. This is the number that determines whether a rule is genuinely actionable intelligence or merely a restatement of what the base rate already tells us

What the top rules mean for 2026 planning

The rules with the highest lift values identify the most specific and most reliable combinations of incident attributes that co-occur with sabotage. These rules answer the question the security and community engagement teams most need heading into 2026: not simply where sabotage happens most often, but which precise combinations of location, facility type, and cause pattern are the strongest predictors of sabotage activity.

The top-ranked rule, {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305, carries a direct scheduling implication. When a cleanup-requiring incident occurs at an oil pipeline in the Lagos Area, it is classified as sabotage in over 96% of cases. This is not a marginal finding. Resources concentrated at oil pipelines in LAR during periods of elevated third-party activity will intercept a disproportionate share of sabotage events relative to their cost.

What the network graph reveals

The network graph visualises all 20 top rules simultaneously as a connected map. Reading it:

Nodes represent individual incident attributes (facility type, area, state, cleanup status, cause)
Edges represent rules, with arrows pointing toward the Sabotage target node
Nodes with many edges connecting to them are the most diagnostically powerful attributes. They appear in the most high-confidence sabotage rules and anchor the security targeting brief
Hovering over any edge displays the exact lift, confidence, and support for that specific rule
Colour coding distinguishes attribute type: blue for facility type, green for area, purple for state, amber for cleanup status, and grey for cause

What this means for project scope

The rules linking specific cause and facility combinations to cleanup requirements have a direct capital implication. If a high-lift rule confirms that incidents at oil pipelines in SAR reliably require cleanup, that combination should be explicitly costed and scoped in the Brass Canal Remediation and Ebocha Water Reinjection project budgets. It belongs in the base estimate as a predictable and quantifiable cost element, not in the contingency reserve.

Recommendation feeding into Section 10

Two outputs from this section feed directly into the 2026 security strategy:

The top five rules by lift from Table AR1 define the specific operational contexts that warrant elevated security presence and proactive community liaison in 2026
The antecedent combinations in those rules — the precise combinations of facility type, area, and cause — replace broad geographic targeting with evidence-based attribute targeting that is both more efficient and more defensible to management

10. Integrated Findings

10.1 What the Five Analyses Collectively Show

Five analytical techniques were applied to the same 1,695-incident dataset, each addressing a different dimension of the 2026 operational planning problem. Taken individually, each produces a useful output. Taken together, they converge on a single, coherent operational picture that no individual analysis could produce alone.

Text Analytics (Section 5)

The text analytics revealed that incident narrative descriptions cluster into five latent topics. Two are operationally diagnostic:

Topic 2, dominated by the terms person, installed, valve, cut, theft, and flowline, captures the language of deliberate human interference. An unknown person physically installing a bypass valve or cutting a flowline to extract oil
Topic 1, centred on pipe, valve, nipple, and flange, captures a second pattern of hardware manipulation at pipe connections

Neither pattern maps cleanly onto the structured cause field. They emerge only from the unstructured narrative text. Together they confirm that the dominant failure mode across the portfolio is not mechanical degradation but organised, repeated physical interference at flowlines and valve points.

Survival Analysis (Section 8)

The Kaplan-Meier curves and Cox Proportional Hazards model show that the highest-recurrence facilities are concentrated in the SAR operational area:

The 18” Tebidaba/Brass Pipeline records a median of just 2.0 days between consecutive incidents across 131 recorded intervals
The 10” Clough Creek/Tebidaba Pipeline records a median of 3.0 days across 141 intervals
The Cox model identifies Pipeline (Oil) in SAR and Bayelsa as the asset class carrying the highest hazard rate, controlling for all other factors

These are not facilities that occasionally fail. They are facilities under sustained, near-continuous attack.

Association Rules (Section 9)

The highest-lift rule in the dataset confirms the mechanism directly:

{Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305
When a cleanup-requiring incident occurs at an oil pipeline, sabotage classification follows in 96.4% of cases, at a rate 30% above what the overall sabotage base rate would predict
Across all top 20 rules, Pipeline (Oil) appears as the dominant antecedent facility type and cleanup requirement is the single strongest co-occurring operational attribute

Monte Carlo Simulation (Section 6)

The simulation quantifies the financial consequence of this pattern:

P50 annual remediation cost: $17,136,316
P90 conservative reserve: $18,603,017
The gap of approximately $1.5 million between P50 and P90 is the contingency required to cover costs in 9 out of 10 years
The tornado sensitivity analysis identifies per-incident remediation cost as the dominant driver of uncertainty. Better cost capture in Joint Investigation Visit (JIV) reports would directly narrow the P90 band in future planning cycles

Prophet Forecasting (Section 7)

The forecast provides the timing context:

Mean RMSE of 15.41 incidents per month across walk-forward cross-validation folds reflects the highly variable and externally driven nature of pipeline sabotage in the Niger Delta
The seasonal component confirms that incident frequency is not uniform across the year
The six-month forward projection provides the capacity planning baseline for emergency response resourcing in the first half of 2026

Convergence visualisation

The chart below summarises how each technique contributes to the single integrated recommendation.

Five-technique convergence map — each analytical output pointing to the same 2026 operational priority

10.2 The Single Integrated Recommendation

Important

2026 Priority Recommendation: Concentrate integrity inspection, surveillance, and community engagement resources on oil pipelines in the SAR operational area, specifically the Tebidaba/Brass and Clough Creek/Tebidaba corridors, while holding a remediation contingency reserve of $18.6 million in the 2026 annual operating budget.

The convergence across all five techniques is striking in its consistency:

The LDA topic model identifies the language of physical interference at flowlines and valve points as the dominant narrative pattern
The survival analysis identifies the Tebidaba and Clough Creek oil pipelines in SAR as the highest-recurrence assets in the portfolio with median inter-incident times of 2 to 3 days
The association rules establish that oil pipeline incidents requiring cleanup are classified as sabotage in over 96% of cases
The Monte Carlo simulation puts the P90 annual cost of this pattern at $18.6 million
The Prophet forecast confirms the pattern is structural and ongoing rather than declining

These five outputs, derived independently from different analytical frameworks, all point to the same underlying operational reality: a concentrated, organised pattern of third-party interference targeting oil transmission pipelines in the SAR area, particularly the Tebidaba and Clough Creek corridors, that is generating the majority of OERNL’s incident volume, remediation cost exposure, and production deferral risk.

The 2026 operational plan should reflect this convergence in four specific actions:

Table IF1: Integrated 2026 Operational Recommendations — Five-Technique Convergence
Priority	2026 Action	Analytical Basis
1	Elevate the 18" Tebidaba/Brass Pipeline and 10" Clough Creek/Tebidaba Pipeline to mandatory Q1 2026 integrity inspection. Both facilities record median inter-incident times below 3 days across 130-plus repeat intervals	Survival analysis: median time to event = 2.0 days (Tebidaba/Brass, 131 intervals); 3.0 days (Clough Creek, 141 intervals). Cox hazard ratio confirms Pipeline (Oil) in SAR as highest-hazard asset class
2	Hold $18,603,017 as the 2026 remediation contingency reserve in the annual operating budget submission to Finance and the Board. This is the P90 Monte Carlo output	Monte Carlo simulation: P50 = $17,136,316; P90 = $18,603,017. Tornado chart identifies per-incident cost as dominant uncertainty driver
3	Direct security patrol and community engagement resources to SAR-area oil pipelines during peak incident months identified by the Prophet seasonal component. Do not distribute resources uniformly across the portfolio	Association rules: top rule confidence 0.964, lift 1.305 for Pipeline (Oil) and cleanup requirement leading to Sabotage. LDA Topic 2 confirms person-valve-cut-theft as dominant interference narrative
4	Expand JIV cost capture to all sabotage incidents, not only the 51 currently recorded. This directly narrows the Monte Carlo P90 band and improves the defensibility of future budget submissions	Monte Carlo limitation: cost distribution fitted from only 51 cost records. Wider cost coverage reduces P90 uncertainty and strengthens the capital plan

The four actions above are not independent recommendations from four separate analyses. They are a single, integrated operational response to a single, concentrated problem that five different analytical lenses have independently identified and quantified. The Tebidaba and Clough Creek corridors are the highest-cost, highest-frequency, highest-sabotage-risk assets in the OERNL portfolio. The 2026 plan should treat them accordingly.

11. Limitations and Further Work

11.1 Data Limitations

The analyses in this submission were conducted on internally sourced operational data that carries several structural limitations. Each limitation is documented below with its specific impact on the analytical outputs and the remediation action that would address it.

Sparse cost coverage

Of the 1,695 incident records in the consolidated dataset, only 51 contain a populated cost_usd field, predominantly from sabotage records in 2019 and 2021. This means the Monte Carlo simulation in Section 6 was fitted from a Gamma distribution calibrated on fewer than 4% of all incidents.

Note

Example impact: If the 51 recorded cost incidents systematically over-represent large, expensive sabotage events while smaller incidents go uncosted, the fitted Gamma distribution will have a heavier tail than reality — producing a P90 of $18,603,017 that may be biased upward. The true P90 with full cost coverage could be materially lower.

The remaining 96% of incidents have no recorded remediation cost, creating a sparse and potentially unrepresentative fitting sample
The single most impactful data improvement available to the department is a standing requirement for JIV teams to record a cost estimate for every incident, not only major sabotage events
This one change would directly narrow the P90 band and improve the defensibility of all future budget submissions to Finance and the Board

Inconsistent facility categorisation

Table SA2 shows that several of the highest-recurrence facilities, including the top-ranked 24” Ogoda-Brass Pipeline and multiple Tebidaba corridor assets, carry no value in the facility_cat field.

Note

Example impact: The 24” Ogoda-Brass Pipeline ranks first in Table SA2 with a median inter-incident time of 1.0 day across 3 intervals, yet it was excluded from the Cox Proportional Hazards model entirely because its facility category is unpopulated. The model therefore understates the hazard for that asset class.

Facilities without a facility category were excluded from Cox model stratification and the facility category layer of the association rules
The hazard ratios reported in Table SA1 are therefore biased toward the better-documented asset classes, understating the true differential between categorised and uncategorised facilities
A systematic audit of the facility category field across the incident register would immediately improve the precision of both outputs

Cause field granularity

The raw cause field contains free-text entries of highly variable length, from single words like “Corrosion” to full sentences describing the sequence of events.

Note

Example impact: The cause entry “Unknown person installed a hacksaw cut on the 18 inch pipeline” was truncated to “Unknown Person Installed A” for the association rule transactions — losing the operationally critical terms “hacksaw cut” and “18 inch pipeline” that would have produced a more specific and more actionable rule.

Truncating to four words for the association rule transactions reduced cardinality but lost meaningful distinctions between cause sub-types
A structured cause taxonomy with a controlled vocabulary of 15 to 20 canonical cause categories, applied consistently at point of entry, would make both the TF-IDF analysis and the association rules significantly more discriminating
This is a data governance change, not a technical one, and requires no additional tooling to implement

Deferred production data

Deferred production volume was available for only 51 records, the same sparse subset as cost data.

Note

Example impact: A major sabotage event on the 18” Tebidaba/Brass Pipeline that shuts down production for 14 days at 5,000 barrels per day represents approximately $3.5 million in lost revenue at $50 per barrel — more than twice the direct remediation cost of a typical large incident. This exposure is entirely absent from the current Monte Carlo P90 figure.

The Monte Carlo simulation was constrained to remediation cost alone and could not model the full financial exposure including production deferral
Incorporating deferred production barrels multiplied by the prevailing oil price would give a materially more complete picture of the true P90 financial exposure
The Board-level contingency reserve figure should eventually reflect total financial exposure, not remediation cost alone

No external covariates for forecasting

The Prophet model in Section 7 captures average seasonal patterns in incident frequency but cannot distinguish between a seasonally elevated period driven by weather versus one driven by community unrest.

Note

Example impact: The Q3 2020 spike visible in the monthly incident series coincided with both the wet season and the post-COVID resumption of community activity in the Niger Delta. The model treats these as a single seasonal effect, but their operational implications are entirely different. Weather-driven spikes require additional containment equipment. Community-driven spikes require additional security presence and engagement resources.

External covariates including rainfall data from NIMET, community tension indices, and DPR inspection schedules were not available for integration as Prophet regressors
Their absence means the seasonal component blends structurally distinct drivers into a single undifferentiated pattern
Separating these signals would improve both forecast accuracy and the operational specificity of the capacity planning recommendation

11.2 Analytical Limitations

Beyond data quality, four analytical assumptions constrain the interpretation of the outputs.

The bag-of-words assumption in text analytics

The LDA topic model discards word order and grammatical structure entirely.

Note

Example impact: The sentences “unknown person cut the pipe” and “the pipe cut an unknown person” produce identical token counts under the bag-of-words model and would be assigned identical topic probabilities. The first describes deliberate sabotage. The second describes an accidental injury. The model cannot distinguish them.

For incident narratives where the subject-verb-object structure carries the operational meaning, a dependency-parsed or transformer-based model such as a fine-tuned BERT variant would extract substantially richer causal intelligence
With access to a GPU environment, replacing the LDA pipeline with a BERTopic or Named Entity Recognition model trained on oil and gas incident text would be the natural analytical next step

The proportional hazards assumption in survival analysis

The Cox model was tested via Schoenfeld residuals, and the p-values reported in Section 8 indicate potential violation for some covariates.

Note

Example impact: The hazard ratio for Pipeline (Oil) relative to the reference facility type is reported as approximately 2.6 in Table SA1. If this ratio was substantially higher in 2019 to 2021 and lower in 2023 to 2025, the reported single figure understates the historical peak risk and overstates the current risk. A time-varying coefficient model would surface this change explicitly.

Pipeline operating conditions in the Niger Delta changed materially between 2019 and 2025. The 2020 oil price shock, COVID-19 movement restrictions, and subsequent recovery each created structural breaks that a time-constant hazard ratio cannot fully capture
A time-varying coefficient Cox model, or a parametric accelerated failure time model with explicit changepoint detection, would more honestly represent the non-stationarity in the recurrence process

Association rules are descriptive, not causal

The highest-lift rule establishes a strong co-occurrence pattern but does not prove causation.

Note

Example impact: The rule {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with lift 1.305 could reflect a reporting bias rather than a real pattern. LAR field engineers may simply be more diligent at recording cleanup requirements, which artificially inflates the co-occurrence with sabotage classification. The rule would then be a documentation artefact rather than an operational signal.

Validating the rules against an independent holdout dataset, or against a separate year of incident records not used in model training, would distinguish genuine operational patterns from documentation artefacts
Until validated, the top rules should be treated as priority hypotheses for field investigation rather than confirmed operational facts

The Prophet forecast carries high uncertainty at monthly level

The mean RMSE of 15.41 incidents per month and MAPE of 114.7% reflect the fundamental unpredictability of sabotage-driven incident counts.

Note

Example impact: A forecast of 20 incidents for March 2026 with a 90% confidence interval of 5 to 35 incidents means the actual count could be anywhere from a quiet month to the highest recorded monthly total in the dataset. Sizing emergency response capacity for exactly 20 incidents and actually experiencing 35 would leave the department 75% under-resourced for that month.

Unlike demand forecasting in retail or manufacturing where Prophet was originally designed, pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can capture from incident frequency alone
The model is useful for annual-level capacity planning and seasonal pattern identification but should not be used to make precise monthly staffing decisions without an explicit uncertainty buffer

11.3 Further Work

The table below summarises what would be done differently with more data, time, or computing power across each of the five techniques.

Table FW1: Further Work — What Would Be Done Differently
Technique	Current Limitation	Proposed Further Work	Resource Required
Text Analytics	LDA bag-of-words loses word order and semantic context in short incident narratives	Replace LDA with BERTopic or fine-tuned BERT Named Entity Recognition on GPU. Extract structured cause entities directly from narrative text	GPU compute environment. Labelled training corpus of 500-plus incident narratives
Monte Carlo Simulation	Cost distribution fitted from only 51 records. 96% of incidents have no cost data	Mandate JIV cost capture for all incidents. Refit Gamma distribution on full portfolio. Extend simulation to include deferred production value	Operational policy change requiring mandatory cost field in incident register from 2026
Advanced Forecasting	MAPE of 114.7% reflects unpredictability of sabotage-driven counts without external covariates	Add external regressors to Prophet — NIMET rainfall, community tension index, crude price — to separate structural from externally triggered spikes	Data sharing agreement with NIMET. Access to Niger Delta Security Tracker
Survival Analysis	Cox model assumes time-constant hazard ratios across a 7-year window with structural breaks	Fit time-varying coefficient Cox model with explicit changepoint detection at 2020 and 2022 structural breaks	No additional data required. Analytical extension of current survival dataset
Association Rules	Rules are descriptive co-occurrences. Causal direction not validated against holdout data	Build gradient-boosted classifier using top rule antecedents as features. Moves from retrospective pattern detection to real-time sabotage risk flagging	12 months of new incident records as holdout validation set

References

Primary Data Source

Olanubi, W. (2026). OERNL pipeline incident register 2019–2025 [Dataset]. Compiled from Joint Investigation Visit reports and the HSE incident management system, Project Services Department, Oando Energy Resources Nigeria Limited, Lagos, Nigeria. Data available on request from the author.

Course Textbook

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Academic and Professional References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487–499). Morgan Kaufmann.

Carsey, T. M., & Harden, J. J. (2013). Monte Carlo simulation and resampling methods for social science (1st ed.). SAGE Publications.

Glasserman, P. (2003). Monte Carlo methods in financial engineering (Vol. 53). Springer.

Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: A self-learning text (3rd ed.). Springer. https://doi.org/10.1007/978-1-4419-6646-9

Marr, B. (2017). Key business analytics: The 60+ business analysis tools every manager needs to know. Pearson Education.

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

R Software and Package Citations

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.5) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Bürkner, P. C., Gabry, J., Kay, M., & Vehtari, A. (2022). posterior: Tools for working with posterior distributions (R package). https://mc-stan.org/posterior/

Firke, S. (2023). janitor: Simple tools for examining and cleaning dirty data (R package version 2.2.0). https://CRAN.R-project.org/package=janitor

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://doi.org/10.18637/jss.v040.i03

Grun, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. https://doi.org/10.18637/jss.v040.i13

Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. https://doi.org/10.18637/jss.v014.i15

Hahsler, M. (2017). arulesViz: Interactive visualization of association rules with R. R Journal, 9(2), 163–175. https://doi.org/10.32614/RJ-2017-047

Kassambara, A., & Kosinski, M. (2021). survminer: Drawing survival curves using ggplot2 (R package version 0.4.9). https://CRAN.R-project.org/package=survminer

Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble

Pedersen, T. L. (2022). patchwork: The composer of plots (R package version 1.1.2). https://CRAN.R-project.org/package=patchwork

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in R. Journal of Open Source Software, 1(3), 37. https://doi.org/10.21105/joss.00037

Spinu, V., Grolemund, G., & Wickham, H. (2023). lubridate: Make dealing with dates a little easier (R package version 1.9.3). https://CRAN.R-project.org/package=lubridate

Taylor, S. J., & Letham, B. (2021). prophet: Automatic forecasting procedure (R package version 1.0). https://CRAN.R-project.org/package=prophet

Therneau, T. M. (2023). A package for survival analysis in R (R package version 3.5-7). https://CRAN.R-project.org/package=survival

Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model. Springer. https://doi.org/10.1007/978-1-4757-3294-8

Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations (R package version 1.5.0). https://CRAN.R-project.org/package=stringr

Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (R package version 1.4.3). https://CRAN.R-project.org/package=readxl

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation (R package version 1.1.4). https://CRAN.R-project.org/package=dplyr

Wickham, H., & Henry, L. (2023). purrr: Functional programming tools (R package version 1.0.2). https://CRAN.R-project.org/package=purrr

Wickham, H., Vaughan, D., & Girlich, M. (2023). tidyr: Tidy messy data (R package version 1.3.0). https://CRAN.R-project.org/package=tidyr

Zeileis, A., & Grothendieck, G. (2005). zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6), 1–27. https://doi.org/10.18637/jss.v014.i06

Python Package Citations

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Rasband, D. (2021). vaderSentiment: VADER sentiment analysis (Python package version 3.3.2). https://github.com/cjhutto/vaderSentiment

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference (pp. 92–96). https://doi.org/10.25080/Majora-92bf1922-011

Taylor, S. J., & Letham, B. (2021). Prophet: Forecasting at scale (Python package version 1.1). https://facebook.github.io/prophet/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Waskom, M. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used as an AI coding assistant throughout this submission. Specifically, Claude assisted with the generation of R and Python code chunks for data loading, cleaning, and transformation in the setup chunk; the construction of ggplot2 and matplotlib visualisations; the implementation of the LDA topic model, Monte Carlo simulation, Prophet forecasting, Kaplan-Meier and Cox Proportional Hazards survival analysis, and Apriori association rule mining pipelines; and the debugging of rendering errors encountered during Quarto document compilation. The document structure, YAML configuration, and inline prose were also developed with Claude’s assistance as a drafting and formatting tool.

Independent analytical judgement was exercised in the following areas:

Case study selection: Case Study 3 was selected as the most appropriate case for my professional context, given direct access to seven years of internal pipeline incident data from OERNL’s HSE management system
Central analytical question: The framing of the research question and the mapping of each technique to a specific 2026 operational decision were determined independently, based on knowledge of the planning decisions facing the Project Services Department heading into 2026
Model output interpretation: All model outputs were interpreted in the context of OERNL’s operational reality, including the identification of the Tebidaba and Clough Creek corridors as the priority intervention assets and the translation of the P90 Monte Carlo figure into a specific Board-level budget recommendation
LDA topic naming: The five LDA topics were named and operationally interpreted based on domain knowledge of Niger Delta pipeline operations, not on algorithmic output alone. The algorithm surfaces word clusters; the operational meaning of those clusters was determined by the author
Data quality identification: The identification of sparse cost coverage, inconsistent facility categorisation, and cause field granularity as the primary data limitations was based on direct familiarity with the source system and its known recording practices
Integrated recommendation: The recommendation in Section 10 draws on seven years of direct professional experience in the Project Services Department to connect the analytical outputs to live planning decisions facing the organisation in 2026

All analytical decisions, business interpretations, and operational conclusions are my own.

--- title: "Pipeline Incident Analytics: Operational Risk, Cost Exposure and Predictive Patterns at a Nigerian Upstream Operator (2019-2025)" author: "Olawale Unimke Olanubi" date: today format: html: theme: flatly css: custom.css toc: true toc-location: left toc-depth: 3 toc-title: "Contents" page-layout: article grid: body-width: 1000px code-fold: true code-tools: true self-contained: true fig-width: 8 fig-height: 5 fig-cap-location: bottom include-in-header: text: | <link rel="preconnect" href="https://fonts.googleapis.com"> <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&family=Playfair+Display:wght@600;700&display=swap" rel="stylesheet"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.0/css/all.min.css"> execute: warning: false message: false echo: true cache: true --- ```{r r-setup, include=FALSE} library(tidyverse); library(readxl); library(janitor); library(lubridate) library(here); library(scales); library(plotly); library(kableExtra) library(reticulate) use_condaenv("lbs-da2", required = TRUE) # ── Helper functions ───────────────────────────────────────────── parse_any_date <- function(x) { result <- suppressWarnings(as.Date(as.numeric(x), origin = "1899-12-30")) failed <- is.na(result) if (any(failed)) { text_parsed <- suppressWarnings( lubridate::parse_date_time(x[failed], orders = c("dmy", "ymd", "mdy", "dmY", "Ymd"), quiet = TRUE) %>% as.Date() ) result[failed] <- text_parsed } result } read_sheet <- function(path, sheet) { read_excel(path, sheet = sheet, col_types = "text") %>% janitor::remove_empty(c("rows", "cols")) %>% janitor::clean_names() } # ── Load and consolidate all 7 Excel files ─────────────────────── files <- list.files("data", pattern = "\\.xlsx$", full.names = TRUE) all_sheets <- purrr::map_dfr(files, function(p) { yr <- as.integer(stringr::str_extract(basename(p), "\\d{4}")) purrr::map_dfr(excel_sheets(p), function(sh) { df <- tryCatch(read_sheet(p, sh), error = function(e) NULL) if (is.null(df) || nrow(df) == 0) return(NULL) df %>% mutate(source_year = yr, source_sheet = stringr::str_squish(sh)) }) }) # ── Clean and standardise ──────────────────────────────────────── incidents <- all_sheets %>% transmute( year = source_year, sheet = source_sheet, month_name = months, ref_no = coalesce(incident_oil_spill_ref_no, incident_ref_no, incident), facility = coalesce(facility_equipment, facility), facility_cat = facility_category, location = location, closeby_facility = closeby_facility, area = area, lga = lga, state = stringr::str_remove(state, " State$"), lat_raw = lat_n, long_raw = long_e, date_incident = coalesce(date_of_incident_spill, date_of_incident), date_observed = date_of_incident_spill_observed, cause = coalesce(incident_spill_cause, incident_cause, nature_of_damage_breach), category = coalesce(incident_spill_category, incident_category), description = coalesce(description_of_incident_spill_causes, description_of_incident_cause_s), qty_spilled_bbl = suppressWarnings(as.numeric(coalesce(est_qty_spilled_bbls, spilled_quantity_bbls))), qty_recovered_bbl= suppressWarnings(as.numeric(est_qty_recovered_bbls)), cost_usd = suppressWarnings(as.numeric(cost_of_repair_estimate_in_usd)), deferred_bbl = suppressWarnings(as.numeric(deferred_production_bbls)), jiv_date = parse_any_date(jiv_date), cleanup_required = toupper(stringr::str_squish(is_cleanup_required)) ) %>% mutate( date_incident = parse_any_date(date_incident), date_observed = parse_any_date(date_observed), detect_lag_days = as.integer(date_observed - date_incident), jiv_lag_days = as.integer(jiv_date - date_incident), month_num = lubridate::month(date_incident), year_month = floor_date(date_incident, "month"), category = case_when( stringr::str_detect(category, "(?i)operational|operation control") ~ "Operational Control", stringr::str_detect(category, "(?i)valve|installed|unknown person|welded|pipe connection") ~ "Sabotage", stringr::str_detect(category, "(?i)tbd") ~ "TBD", TRUE ~ category ), cause = stringr::str_to_title(stringr::str_squish(cause)) ) %>% filter(!is.na(date_incident)) # ── Icon helper for Section 2.2 ────────────────────────────────── icon <- function(path_d, color = "#2c7fb8", size = "1em") { glue::glue( '<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" ', 'style="height:{size};width:{size};vertical-align:-0.15em;margin-right:5px;fill:{color};">', '<path d="{path_d}"/>', '</svg>' ) } ic_search <- "M416 208c0 45.9-14.9 88.3-40 122.7L502.6 457.4c12.5 12.5 12.5 32.8 0 45.3s-32.8 12.5-45.3 0L330.7 376c-34.4 25.2-76.8 40-122.7 40C93.1 416 0 322.9 0 208S93.1 0 208 0S416 93.1 416 208zM208 352a144 144 0 1 0 0-288 144 144 0 1 0 0 288z" ic_cluster <- "M-1 224C-1 100.3 100.3-1 224-1s225 101.3 225 225-101.3 225-225 225S-1 347.7-1 224zm225 97a97 97 0 1 0 0-194 97 97 0 1 0 0 194z" ic_lightbulb <- "M297.2 248.9C311.6 228.3 320 203.2 320 176c0-70.7-57.3-128-128-128S64 105.3 64 176c0 27.2 8.4 52.3 22.8 72.9c3.7 5.3 8.1 11.3 12.8 17.7c0 0 0 0 0 0c12.9 17.7 28.3 38.9 32.6 62.7c-4-1.1-8.5-1.7-13.2-1.7H112c-8.8 0-16 7.2-16 16s7.2 16 16 16h16v16c0 8.8 7.2 16 16 16h64c8.8 0 16-7.2 16-16V359.2h16c8.8 0 16-7.2 16-16s-7.2-16-16-16h-7c4.3-23.8 19.7-45 32.6-62.7c0 0 0 0 0 0c4.7-6.4 9.1-12.4 12.8-17.7z" ic_database <- "M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3z" ic_chartbar <- "M32 32c17.7 0 32 14.3 32 32V400c0 8.8 7.2 16 16 16H480c17.7 0 32 14.3 32 32s-14.3 32-32 32H80c-44.2 0-80-35.8-80-80V64C0 46.3 14.3 32 32 32zm96 96c0-17.7 14.3-32 32-32l192 0c17.7 0 32 14.3 32 32s-14.3 32-32 32l-192 0c-17.7 0-32-14.3-32-32zm32 64H288c17.7 0 32 14.3 32 32s-14.3 32-32 32H160c-17.7 0-32-14.3-32-32s14.3-32 32-32zm0 96H352c17.7 0 32 14.3 32 32s-14.3 32-32 32H160c-17.7 0-32-14.3-32-32s14.3-32 32-32z" ic_shield <- "M256 0c4.6 0 9.2 1 13.4 2.9L457.7 82.8c22 9.3 38.4 31 38.3 57.2c-.5 99.2-41.3 280.7-213.7 363.2c-16.7 8-36.1 8-52.8 0C57.3 420.7 16.5 239.2 16 140c-.1-26.2 16.3-47.9 38.3-57.2L242.7 2.9C246.8 1 251.4 0 256 0z" ic_wave <- "M32 256c0-17.7 14.3-32 32-32c9.2 0 17.5 3.9 23.4 10.1L176 352.7 265.7 174.1c5.7-11.4 17.3-18.6 30.1-18.6s24.4 7.1 30.1 18.6l64.6 129.1 39.3-59c5.9-8.9 15.9-14.2 26.6-14.2H480c17.7 0 32 14.3 32 32s-14.3 32-32 32H472.6L405.1 397c-5.9 8.8-15.8 14-26.4 14s-20.5-5.2-26.3-14.1l-66.5-99.8-55.6 111.2c-5.7 11.4-17.3 18.6-30.1 18.6s-24.4-7.1-30.1-18.6L104.4 288H64c-17.7 0-32-14.3-32-32z" ic_calendar <- "M128 0c17.7 0 32 14.3 32 32V64H288V32c0-17.7 14.3-32 32-32s32 14.3 32 32V64h48c26.5 0 48 21.5 48 48v48H0V112C0 85.5 21.5 64 48 64H96V32c0-17.7 14.3-32 32-32zM0 192H448V464c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V192zm296 72c-9.7 0-18.5 5.8-22.2 14.8s-1.7 19.3 5.1 26.1l32 32-32 32c-6.8 6.8-8.8 17-5.1 26.1s12.5 14.8 22.2 14.8h64c17.7 0 32-14.3 32-32s-14.3-32-32-32l0 0 13.7-13.7c6.8-6.8 8.8-17 5.1-26.1S373.7 264 364 264H296z" ic_check <- "M256 512A256 256 0 1 0 256 0a256 256 0 1 0 0 512zM369 209L241 337c-9.4 9.4-24.6 9.4-33.9 0l-64-64c-9.4-9.4-9.4-24.6 0-33.9s24.6-9.4 33.9 0l47 47L335 175c9.4-9.4 24.6-9.4 33.9 0s9.4 24.6 0 33.9z" ic_listol <- "M24 56c0-13.3 10.7-24 24-24H232c13.3 0 24 10.7 24 24s-10.7 24-24 24H48c-13.3 0-24-10.7-24-24zm0 160c0-13.3 10.7-24 24-24H232c13.3 0 24 10.7 24 24s-10.7 24-24 24H48c-13.3 0-24-10.7-24-24zm0 160c0-13.3 10.7-24 24-24H232c13.3 0 24 10.7 24 24s-10.7 24-24 24H48c-13.3 0-24-10.7-24-24zM356.9 3.5c6.1-3.9 13.8-4.2 20.2-.8l72 40c11.9 6.6 16.2 21.6 9.6 33.5s-21.6 16.2-33.5 9.6L400 71.4V144c0 13.3-10.7 24-24 24s-24-10.7-24-24V32c0-8.7 4.7-16.7 12.9-28.5zM320 256c0-13.3 10.7-24 24-24h80c13.3 0 24 10.7 24 24s-10.7 24-24 24h-80c-13.3 0-24-10.7-24-24z" ic_contract <- "M64 0C28.7 0 0 28.7 0 64V448c0 35.3 28.7 64 64 64H320c35.3 0 64-28.7 64-64V160H256c-17.7 0-32-14.3-32-32V0H64zM256 0V128H384L256 0zM112 256H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16z" ic_layers <- "M232.5 5.7c13.9-7.6 30.1-7.6 44 0l208 114.1c14.2 7.8 23 22.9 23 39.2s-8.8 31.4-23 39.2L276.5 312.3c-13.9 7.6-30.1 7.6-44 0L24.5 198.2C10.3 190.4 1.5 175.3 1.5 159s8.8-31.4 23-39.2L232.5 5.7zM1.5 316.8l208 114.1c13.9 7.6 30.1 7.6 44 0l208-114.1c14.2-7.8 23-22.9 23-39.2V352c0 16.3-8.8 31.4-23 39.2L277.5 505.3c-13.9 7.6-30.1 7.6-44 0L25.5 391.2C11.3 383.4 2.5 368.3 2.5 352V277.6c0 16.3 8.8 31.4 23 39.2z" ic_brain <- "M184 0c30.9 0 56 25.1 56 56V456c0 30.9-25.1 56-56 56c-28.9 0-52.7-21.9-55.7-50.1C74.1 447.1 48 415.8 48 378.4V352c-26.5 0-48-21.5-48-48v-64c0-26.5 21.5-48 48-48V160c0-26.5 21.5-48 48-48h8V56c0-30.9 25.1-56 56-56zM248 56V464c0 30.9 25.1 56 56 56c28.9 0 52.7-21.9 55.7-50.1C393.9 463.1 416 431.8 416 394.4V352c26.5 0 48-21.5 48-48V240c0-26.5-21.5-48-48-48V160c0-26.5-21.5-48-48-48h-8V56c0-30.9-25.1-56-56-56z" ic_wrench <- "M352 320c88.4 0 160-71.6 160-160c0-15.3-2.2-30.1-6.2-44.2c-3.1-10.8-16.4-13.2-24.3-5.3l-76.8 76.8c-3 3-7.1 4.7-11.3 4.7H336c-8.8 0-16-7.2-16-16V118.6c0-4.2 1.7-8.3 4.7-11.3l76.8-76.8c7.9-7.9 5.4-21.2-5.3-24.3C382.1 2.2 367.3 0 352 0C263.6 0 192 71.6 192 160c0 19.1 3.4 37.5 9.5 54.5L19.9 396.1C7.2 408.8 0 426.1 0 444.1C0 481.6 30.4 512 67.9 512c18 0 35.3-7.2 48-19.9L297.5 310.5c17 6.1 35.4 9.5 54.5 9.5z" ic_clipboard <- "M192 0c-41.8 0-77.4 26.7-90.5 64H64C28.7 64 0 92.7 0 128V448c0 35.3 28.7 64 64 64H320c35.3 0 64-28.7 64-64V128c0-35.3-28.7-64-64-64H282.5C269.4 26.7 233.8 0 192 0zm0 64a32 32 0 1 1 0 64 32 32 0 1 1 0-64zM128 256a64 64 0 1 1 128 0 64 64 0 1 1 -128 0zm-24 88h144c13.3 0 24 10.7 24 24s-10.7 24-24 24H104c-13.3 0-24-10.7-24-24s10.7-24 24-24z" ``` # 1. Executive Summary Sabotage accounts for 74% of pipeline incidents at Oando Energy Resources Nigeria Limited, yet the financial and operational consequences have never been systematically quantified. This submission applies five analytical techniques to 1,695 incidents recorded across 446 facilities in Nigeria's Niger Delta between 2019 and 2025, drawn from OERNL's internal HSE incident register and Joint Investigation Visit reports, to answer one question: what should the Project Services Department prioritise in 2026? Key findings across the five techniques: - `r icon(ic_search)` **Text analytics** identified deliberate physical interference at flowlines and valve connections as the dominant latent narrative theme, invisible in the structured cause field - `r icon(ic_shield)` **Monte Carlo simulation** quantified the P90 annual remediation exposure at **$18,603,017** - `r icon(ic_calendar)` **Forecasting** projected monthly incident volumes with a cross-validation RMSE of 15.41 incidents per month, providing the 2026 emergency response capacity baseline - `r icon(ic_listol)` **Survival analysis** identified the Tebidaba/Brass and Clough Creek/Tebidaba pipelines as highest-recurrence assets, with median inter-incident times of 2 and 3 days across 130-plus intervals each - `r icon(ic_brain)` **Association rules** confirmed oil pipeline incidents requiring cleanup predict sabotage with 96.4% confidence and lift of 1.305 ::: {.callout-important icon=false} **Recommendation:** Concentrate 2026 integrity inspection, surveillance, and community engagement on SAR-area oil pipelines, with priority on the Tebidaba and Clough Creek corridors, and hold **$18.6 million** as the Board-level remediation contingency reserve. ::: # 2. Professional Disclosure ## 2.1 My Role and Organisation I am a Project Services Manager in the Project Services Department (PSD), which sits within the Project Development and Infrastructure (PDI) division of Oando Energy Resources Nigeria Limited (OERNL), a Nigerian upstream oil and gas company operating across exploration, development, and production assets in the Niger Delta. ```{r org-chart, echo=FALSE, fig.cap="Organisational structure of the Project Development and Infrastructure division", fig.width=20, fig.height=13} #| message: false #| warning: false library(ggplot2) # ── Y positions ─────────────────────────────────────────────────── y_title <- 12.8 y_sub_ttl <- 12.2 y_gm <- 11.4 y_bus1 <- 10.6 y_div <- 9.8 y_bus2 <- 8.9 y_sub1 <- 8.1 y_grid_t <- 7.0 y_grid_b <- 5.9 y_legend <- 4.6 # ── X positions ─────────────────────────────────────────────────── x_opd <- 3.5 x_ps <- 9.0 x_cr <- 13.5 x_ed <- 17.5 x_sar <- 2.2 x_lar <- 4.8 x_mid_sl <- (x_sar + x_lar) / 2 # = 3.5 x_g1 <- x_mid_sl - 1.2 x_g2 <- x_mid_sl + 1.2 x_g3 <- x_mid_sl - 1.2 x_g4 <- x_mid_sl + 1.2 x_hse <- 9.0 x_noa <- 17.5 # Title centred on GM box x = x_ps = 9.0 x_title <- x_ps # Box sizes w_gm <- 3.8; h_gm <- 0.95 w_div <- 3.0; h_div <- 0.90 w_ps <- 3.6; h_ps <- 1.20 w_sub <- 2.1; h_sub <- 0.78 w_func <- 1.90; h_func <- 0.78 w_hse <- 2.5; h_hse <- 0.78 w_noa <- 2.7; h_noa <- 0.78 p <- ggplot() + # ── Title centred on GM x ────────────────────────────────────── annotate("text", x = x_title, y = y_title, label = "OANDO ENERGY RESOURCES NIGERIA LIMITED", colour = "#1F3864", size = 8.0, fontface = "bold", hjust = 0.5) + annotate("text", x = x_title, y = y_sub_ttl, label = "Project Development & Infrastructure - Departmental Structure", colour = "#555555", size = 5.5, hjust = 0.5) + # ── GM box ───────────────────────────────────────────────────── annotate("rect", xmin = x_ps - w_gm/2, xmax = x_ps + w_gm/2, ymin = y_gm - h_gm/2, ymax = y_gm + h_gm/2, fill = "#1F3864", colour = "#2c7fb8", linewidth = 1.3) + annotate("text", x = x_ps, y = y_gm, label = "General Manager\nProjects & Infrastructure", colour = "white", size = 6.0, fontface = "bold", lineheight = 0.95, hjust = 0.5, vjust = 0.5) + # ── GM to bus ────────────────────────────────────────────────── annotate("segment", x = x_ps, xend = x_ps, y = y_gm - h_gm/2, yend = y_bus1, colour = "#888888", linewidth = 0.9) + annotate("segment", x = x_opd, xend = x_ed, y = y_bus1, yend = y_bus1, colour = "#888888", linewidth = 0.9) + # ── Bus drop lines ───────────────────────────────────────────── annotate("segment", x = x_opd, xend = x_opd, y = y_bus1, yend = y_div + h_div/2, colour = "#888888", linewidth = 0.9, arrow = arrow(length = unit(0.20,"cm"), type = "closed")) + annotate("segment", x = x_ps, xend = x_ps, y = y_bus1, yend = y_div + h_ps/2, colour = "#2c7fb8", linewidth = 1.4, arrow = arrow(length = unit(0.20,"cm"), type = "closed")) + annotate("segment", x = x_cr, xend = x_cr, y = y_bus1, yend = y_div + h_div/2, colour = "#888888", linewidth = 0.9, arrow = arrow(length = unit(0.20,"cm"), type = "closed")) + annotate("segment", x = x_ed, xend = x_ed, y = y_bus1, yend = y_div + h_div/2, colour = "#888888", linewidth = 0.9, arrow = arrow(length = unit(0.20,"cm"), type = "closed")) + # ── OPD box ──────────────────────────────────────────────────── annotate("rect", xmin = x_opd - w_div/2, xmax = x_opd + w_div/2, ymin = y_div - h_div/2, ymax = y_div + h_div/2, fill = "#2c3e6a", colour = "#555577", linewidth = 1.0) + annotate("text", x = x_opd, y = y_div, label = "Onshore Projects\nDivision", colour = "white", size = 5.5, fontface = "bold", lineheight = 0.95, hjust = 0.5, vjust = 0.5) + # ── PS box ───────────────────────────────────────────────────── annotate("rect", xmin = x_ps - w_ps/2, xmax = x_ps + w_ps/2, ymin = y_div - h_ps/2, ymax = y_div + h_ps/2, fill = "#1a5276", colour = "#2c7fb8", linewidth = 1.8) + annotate("text", x = x_ps, y = y_div + 0.24, label = "Project Services", colour = "white", size = 6.0, fontface = "bold", hjust = 0.5, vjust = 0.5) + annotate("text", x = x_ps, y = y_div - 0.28, label = "Project Control | Cost Control\nScheduling | Reporting", colour = "#aaccee", size = 4.5, lineheight = 0.9, hjust = 0.5, vjust = 0.5) + # ── CR box ───────────────────────────────────────────────────── annotate("rect", xmin = x_cr - w_div/2, xmax = x_cr + w_div/2, ymin = y_div - h_div/2, ymax = y_div + h_div/2, fill = "#2c3e6a", colour = "#555577", linewidth = 1.0) + annotate("text", x = x_cr, y = y_div, label = "Construction &\nRevamping", colour = "white", size = 5.5, fontface = "bold", lineheight = 0.95, hjust = 0.5, vjust = 0.5) + # ── ED box ───────────────────────────────────────────────────── annotate("rect", xmin = x_ed - w_div/2, xmax = x_ed + w_div/2, ymin = y_div - h_div/2, ymax = y_div + h_div/2, fill = "#2c3e6a", colour = "#555577", linewidth = 1.0) + annotate("text", x = x_ed, y = y_div, label = "Engineering\nDivision", colour = "white", size = 5.5, fontface = "bold", lineheight = 0.95, hjust = 0.5, vjust = 0.5) + # ── PS cross-functional dashed arrows ────────────────────────── annotate("segment", x = x_ps - w_ps/2, xend = x_opd + w_div/2, y = y_div, yend = y_div, colour = "#e67e22", linewidth = 1.3, linetype = "dashed", arrow = arrow(length = unit(0.17,"cm"), ends = "both", type = "open")) + annotate("text", x = (x_ps - w_ps/2 + x_opd + w_div/2) / 2, y = y_div + 0.52, label = "Cross-Functional Hub", colour = "#e67e22", size = 4.0, fontface = "italic") + annotate("segment", x = x_ps + w_ps/2, xend = x_cr - w_div/2, y = y_div, yend = y_div, colour = "#e67e22", linewidth = 1.3, linetype = "dashed", arrow = arrow(length = unit(0.17,"cm"), ends = "both", type = "open")) + annotate("segment", x = x_cr + w_div/2, xend = x_ed - w_div/2, y = y_div, yend = y_div, colour = "#e67e22", linewidth = 1.3, linetype = "dashed", arrow = arrow(length = unit(0.17,"cm"), ends = "both", type = "open")) + # ── PS dotted to HSE ─────────────────────────────────────────── annotate("segment", x = x_ps, xend = x_ps, y = y_div - h_ps/2, yend = y_sub1 + h_hse/2, colour = "#e67e22", linewidth = 1.2, linetype = "dotted", arrow = arrow(length = unit(0.16,"cm"), type = "closed")) + # ── HSE box ──────────────────────────────────────────────────── annotate("rect", xmin = x_hse - w_hse/2, xmax = x_hse + w_hse/2, ymin = y_sub1 - h_hse/2, ymax = y_sub1 + h_hse/2, fill = "#1a5276", colour = "#2c7fb8", linewidth = 1.0) + annotate("text", x = x_hse, y = y_sub1, label = "HSE Support", colour = "white", size = 5.2, fontface = "bold", hjust = 0.5, vjust = 0.5) + # ── Dotted HSE bottom to NOA bottom — thick and visible ──────── annotate("segment", x = x_hse, xend = x_hse, y = y_sub1 - h_hse/2, yend = y_sub1 - 1.1, colour = "#e67e22", linewidth = 1.3, linetype = "dotted") + annotate("segment", x = x_hse, xend = x_noa, y = y_sub1 - 1.1, yend = y_sub1 - 1.1, colour = "#e67e22", linewidth = 1.3, linetype = "dotted") + annotate("segment", x = x_noa, xend = x_noa, y = y_sub1 - 1.1, yend = y_sub1 - h_noa/2, colour = "#e67e22", linewidth = 1.3, linetype = "dotted", arrow = arrow(length = unit(0.16,"cm"), type = "open")) + # ── ED dashed to NOA ─────────────────────────────────────────── annotate("segment", x = x_ed, xend = x_ed, y = y_div - h_div/2, yend = y_sub1 + h_noa/2, colour = "#888888", linewidth = 1.0, linetype = "dashed", arrow = arrow(length = unit(0.16,"cm"), type = "closed")) + # ── NOA box ──────────────────────────────────────────────────── annotate("rect", xmin = x_noa - w_noa/2, xmax = x_noa + w_noa/2, ymin = y_sub1 - h_noa/2, ymax = y_sub1 + h_noa/2, fill = "#2c3e6a", colour = "#555577", linewidth = 1.0) + annotate("text", x = x_noa, y = y_sub1, label = "Non-Operated Asset", colour = "white", size = 5.0, fontface = "bold", hjust = 0.5, vjust = 0.5) + # ── OPD to SAR and LAR ───────────────────────────────────────── annotate("segment", x = x_opd, xend = x_opd, y = y_div - h_div/2, yend = y_bus2, colour = "#888888", linewidth = 0.9) + annotate("segment", x = x_sar, xend = x_lar, y = y_bus2, yend = y_bus2, colour = "#888888", linewidth = 0.9) + annotate("segment", x = x_sar, xend = x_sar, y = y_bus2, yend = y_sub1 + h_sub/2, colour = "#888888", linewidth = 0.9, arrow = arrow(length = unit(0.16,"cm"), type = "closed")) + annotate("segment", x = x_lar, xend = x_lar, y = y_bus2, yend = y_sub1 + h_sub/2, colour = "#888888", linewidth = 0.9, arrow = arrow(length = unit(0.16,"cm"), type = "closed")) + # ── SAR box ──────────────────────────────────────────────────── annotate("rect", xmin = x_sar - w_sub/2, xmax = x_sar + w_sub/2, ymin = y_sub1 - h_sub/2, ymax = y_sub1 + h_sub/2, fill = "#3a4a6a", colour = "#666688", linewidth = 0.8) + annotate("text", x = x_sar, y = y_sub1, label = "Projects (SAR)", colour = "white", size = 5.0, fontface = "bold", hjust = 0.5, vjust = 0.5) + # ── LAR box ──────────────────────────────────────────────────── annotate("rect", xmin = x_lar - w_sub/2, xmax = x_lar + w_sub/2, ymin = y_sub1 - h_sub/2, ymax = y_sub1 + h_sub/2, fill = "#3a4a6a", colour = "#666688", linewidth = 0.8) + annotate("text", x = x_lar, y = y_sub1, label = "Projects (LAR)", colour = "white", size = 5.0, fontface = "bold", hjust = 0.5, vjust = 0.5) + # ── 2x2 grid centred on midpoint of SAR and LAR ──────────────── annotate("rect", xmin = x_g1 - w_func/2, xmax = x_g1 + w_func/2, ymin = y_grid_t - h_func/2, ymax = y_grid_t + h_func/2, fill = "#4a5a7a", colour = "#778899", linewidth = 0.7) + annotate("text", x = x_g1, y = y_grid_t, label = "Project\nDevelopment", colour = "white", size = 4.2, fontface = "bold", lineheight = 0.9, hjust = 0.5, vjust = 0.5) + annotate("rect", xmin = x_g2 - w_func/2, xmax = x_g2 + w_func/2, ymin = y_grid_t - h_func/2, ymax = y_grid_t + h_func/2, fill = "#4a5a7a", colour = "#778899", linewidth = 0.7) + annotate("text", x = x_g2, y = y_grid_t, label = "Project\nManagement", colour = "white", size = 4.2, fontface = "bold", lineheight = 0.9, hjust = 0.5, vjust = 0.5) + annotate("rect", xmin = x_g3 - w_func/2, xmax = x_g3 + w_func/2, ymin = y_grid_b - h_func/2, ymax = y_grid_b + h_func/2, fill = "#4a5a7a", colour = "#778899", linewidth = 0.7) + annotate("text", x = x_g3, y = y_grid_b, label = "Project\nExecution", colour = "white", size = 4.2, fontface = "bold", lineheight = 0.9, hjust = 0.5, vjust = 0.5) + annotate("rect", xmin = x_g4 - w_func/2, xmax = x_g4 + w_func/2, ymin = y_grid_b - h_func/2, ymax = y_grid_b + h_func/2, fill = "#4a5a7a", colour = "#778899", linewidth = 0.7) + annotate("text", x = x_g4, y = y_grid_b, label = "Construction &\nHandover", colour = "white", size = 4.0, fontface = "bold", lineheight = 0.9, hjust = 0.5, vjust = 0.5) + # ── Legend ───────────────────────────────────────────────────── annotate("rect", xmin = 0.5, xmax = 1.3, ymin = y_legend - 0.28, ymax = y_legend + 0.28, fill = "#1F3864", colour = "#2c7fb8", linewidth = 0.6) + annotate("text", x = 1.5, y = y_legend, label = "GM Level", hjust = 0, colour = "#333333", size = 4.5) + annotate("rect", xmin = 4.2, xmax = 5.0, ymin = y_legend - 0.28, ymax = y_legend + 0.28, fill = "#1a5276", colour = "#2c7fb8", linewidth = 0.6) + annotate("text", x = 5.2, y = y_legend, label = "Project Services Hub", hjust = 0, colour = "#333333", size = 4.5) + annotate("rect", xmin = 9.2, xmax = 10.0, ymin = y_legend - 0.28, ymax = y_legend + 0.28, fill = "#2c3e6a", colour = "#555577", linewidth = 0.6) + annotate("text", x = 10.2, y = y_legend, label = "Division / Department", hjust = 0, colour = "#333333", size = 4.5) + annotate("segment", x = 14.0, xend = 15.1, y = y_legend, yend = y_legend, colour = "#e67e22", linewidth = 1.2, linetype = "dashed", arrow = arrow(length = unit(0.14,"cm"), ends = "both", type = "open")) + annotate("text", x = 15.3, y = y_legend, label = "Project Services Links", hjust = 0, colour = "#333333", size = 4.5) + annotate("segment", x = 18.2, xend = 19.3, y = y_legend, yend = y_legend, colour = "#e67e22", linewidth = 1.2, linetype = "dotted") + annotate("text", x = 19.5, y = y_legend, label = "Functional Link", hjust = 0, colour = "#333333", size = 4.5) + # ── Theme ────────────────────────────────────────────────────── scale_x_continuous(limits = c(0, 20.5), expand = c(0, 0)) + scale_y_continuous(limits = c(4.0, 13.4), expand = c(0, 0)) + theme_void() + theme( plot.background = element_rect(fill = "white", colour = NA), plot.margin = margin(15, 15, 15, 15) ) print(p) ``` My role is cross-functional, sitting at the intersection of project delivery and operational risk management. My core responsibilities span four areas: - **Project controls and scheduling** across OERNL's active capital and infrastructure projects - **Cost management and budget reporting**, including inputs to Board-level capital expenditure (CAPEX) reviews - **Operational performance reporting**, covering project progress, milestone tracking, and variance analysis - **Pipeline incident data analysis**, used to inform project prioritisation, maintenance scheduling, risk budgeting, and regulatory compliance reporting The dataset underpinning this submission was compiled directly from OERNL's internal pipeline incident registers, covering incidents recorded across our operational fields from 2019 to 2025. Analysing this data is not an academic exercise. For me it is a routine input into the decisions my department makes about which infrastructure projects to prioritise and how to size contingency budgets. --- ## 2.2 Why These Five Techniques Are Operationally Relevant ### Technique 1: Text Analytics and Sentiment Analysis Each pipeline incident record contains a free-text description field in which field engineers and operations personnel narrate the cause, sequence of events, and immediate corrective action taken. These narrative fields are currently not systematically analysed in our reporting practice. They are read selectively during incident reviews but never mined at scale. Applying text analytics to this corpus serves three operational purposes: - `r icon(ic_search)` **Pattern extraction:** Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words models surface the most distinctive terms associated with high-cost or high-frequency incidents - `r icon(ic_cluster)` **Causal clustering:** Latent Dirichlet Allocation (LDA) topic modelling groups incidents by underlying cause type, such as corrosion, third-party interference, or mechanical fatigue, in a way that manual categorisation misses - `r icon(ic_lightbulb)` **Scope intelligence:** Understanding which failure causes are trending upward changes both what capital projects we initiate and how we rank them in the project pipeline ::: {.callout-note icon=false} **Operational link:** The text analytics output feeds directly into root-cause analysis and capital project scope definition, surfacing failure themes that no structured field in the incident register currently captures. ::: --- ### Technique 2: Monte Carlo Simulation Capital project cost estimation at OERNL is inherently uncertain. Procurement lead times vary, contractor mobilisation costs fluctuate with foreign exchange movements, and incident remediation costs depend on the severity and duration of the event. In the Project Services Department, we are required to present 50th percentile (P50) and 80th percentile (P80) cost estimates to management for budgeting and Board reporting purposes. Monte Carlo simulation addresses this in three ways: - `r icon(ic_database)` **Distribution fitting:** Historical incident cost data is used to fit probability distributions to each uncertain input, replacing assumptions of normality with evidence-based parameters - `r icon(ic_chartbar)` **Probabilistic output:** Running 10,000 simulations produces a full distribution of possible total cost outcomes, from which 10th percentile (P10), P50, and P80 values are extracted - `r icon(ic_shield)` **Contingency justification:** The gap between P50 and P80 becomes the defensible contingency budget figure presented to the Board, grounded in data rather than rule-of-thumb percentages ::: {.callout-note icon=false} **Operational link:** The simulation output replaces single-point cost assumptions with a probability-weighted reserve figure. The P90 value becomes the Board-level contingency recommendation in the 2026 capital plan. ::: --- ### Technique 3: Advanced Forecasting with Prophet Production uptime planning and maintenance scheduling at OERNL require forward visibility on incident frequency and associated downtime. With seven years of monthly incident data covering 2019 to 2025, the dataset contains sufficient temporal depth to support a robust forecasting model. Prophet is the appropriate technique here for three reasons: - `r icon(ic_wave)` **Trend and seasonality decomposition:** Pipeline incidents in the Niger Delta exhibit seasonal patterns driven by weather, community activity cycles, and regulatory inspection windows, which Prophet captures natively - `r icon(ic_calendar)` **Actionable horizon:** A three to six month forward projection of incident counts or cumulative downtime maps directly onto the lead time required for contractor mobilisation and materials procurement - `r icon(ic_check)` **Communicable uncertainty:** Walk-forward cross-validation produces a Root Mean Square Error (RMSE) that I can present to PDI leadership as evidence of forecast reliability, replacing intuition-based planning with quantified confidence intervals ::: {.callout-note icon=false} **Operational link:** The six-month forecast directly determines the 2026 emergency response capacity plan. Manpower levels, containment boom inventory, and recovery vessel deployment are sized against the upper bound of the forecast confidence interval. ::: --- ### Technique 4: Asset Survival Analysis Not all pipeline segments fail at the same rate or at the same age. Current prioritisation in our department relies heavily on engineering judgement and field reports, which introduces subjectivity into decisions that carry significant capital implications. Kaplan-Meier survival analysis treats each pipeline segment as a subject and its first recorded failure as the event, producing empirical survival curves that quantify failure probability over time. This technique serves the following operational functions: - `r icon(ic_listol)` **Replacement queue prioritisation:** Segments with survival probability below a defined threshold within the next 12 months are flagged as candidates for the capital replacement programme - `r icon(ic_contract)` **Business case support:** Survival curves provide statistically defensible evidence for integrity intervention projects when presenting to Finance and the Board - `r icon(ic_layers)` **Segmentation by asset class:** Stratifying curves by location, pipeline type, or operating condition reveals which asset categories carry disproportionate risk, enabling targeted rather than blanket maintenance spend ::: {.callout-note icon=false} **Operational link:** The survival output produces a ranked list of facilities by recurrence hazard. The bottom quartile by median inter-incident time becomes the mandatory inclusion list for the 2026 Q1 integrity inspection programme. ::: --- ### Technique 5: Association Rules (Operational Co-occurrence Analysis) Incident records contain multiple categorical fields including location, cause category, incident type, affected asset class, and response action taken. These fields are currently reviewed in isolation. Association rule mining using the Apriori algorithm examines all combinations simultaneously, revealing which attributes co-occur more frequently than chance would predict. The operational value of this technique is threefold: - `r icon(ic_brain)` **Co-occurrence intelligence:** Rules such as "corrosion at Field X and pipeline age above 15 years implies production deferment exceeding 48 hours" with high confidence and lift provide actionable scheduling inputs that no single-variable analysis would surface - `r icon(ic_wrench)` **Project scope refinement:** If a high-lift rule links a specific cause type to a particular asset class, that finding directly informs the scope of the corresponding capital maintenance project - `r icon(ic_clipboard)` **Protocol design:** Frequent itemsets linking incident types to response actions allow the department to design more targeted emergency response protocols, reducing average resolution time and cost per incident ::: {.callout-note icon=false} **Operational link:** The top association rules by lift define the specific cause, location, and facility type combinations that warrant elevated security presence and proactive community liaison in the 2026 security and community engagement targeting strategy. ::: # 3. Data Collection & Sampling ```{r section3-icons, echo=FALSE} #| message: false #| warning: false ic_source <- "M64 0C28.7 0 0 28.7 0 64V448c0 35.3 28.7 64 64 64H320c35.3 0 64-28.7 64-64V160H256c-17.7 0-32-14.3-32-32V0H64zM256 0V128H384L256 0zM112 256H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H288c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16z" ic_method <- "M471.6 21.7c-21.9-21.9-57.3-21.9-79.2 0L362.3 51.7l97.9 97.9 30.1-30.1c21.9-21.9 21.9-57.3 0-79.2L471.6 21.7zm-299.2 220c-6.1 6.1-10.8 13.6-13.5 21.9l-29.6 88.8c-2.9 8.6-.6 18.1 5.8 24.6s15.9 8.7 24.6 5.8l88.8-29.6c8.2-2.7 15.7-7.4 21.9-13.5L437.7 172.3 339.7 74.3 172.4 241.7zM96 64C43 64 0 107 0 160V416c0 53 43 96 96 96H352c53 0 96-43 96-96V320c0-17.7-14.3-32-32-32s-32 14.3-32 32V416c0 17.7-14.3 32-32 32H96c-17.7 0-32-14.3-32-32V160c0-17.7 14.3-32 32-32h96c17.7 0 32-14.3 32-32s-14.3-32-32-32H96z" ic_frame <- "M0 96C0 60.7 28.7 32 64 32H448c35.3 0 64 28.7 64 64V416c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V96zm64 0v64h64V96H64zm384 0H192v64H448V96zM64 224v64h64V224H64zm384 0H192v64H448V224zM64 352v64h64V352H64zm384 0H192v64H448V352z" ic_period <- "M152 24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H64C28.7 64 0 92.7 0 128v16 48V448c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192 144 128c0-35.3-28.7-64-64-64H344V24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H152V24zM48 192H400V448c0 8.8-7.2 16-16 16H64c-8.8 0-16-7.2-16-16V192z" ic_ethics <- "M256 0c4.6 0 9.2 1 13.4 2.9L457.7 82.8c22 9.3 38.4 31 38.3 57.2c-.5 99.2-41.3 280.7-213.7 363.2c-16.7 8-36.1 8-52.8 0C57.3 420.7 16.5 239.2 16 140c-.1-26.2 16.3-47.9 38.3-57.2L242.7 2.9C246.8 1 251.4 0 256 0z" ic_restrict <- "M144 144v48H304V144c0-44.2-35.8-80-80-80s-80 35.8-80 80zM80 192V144C80 64.5 144.5 0 224 0s144 64.5 144 144v48h16c35.3 0 64 28.7 64 64V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V256c0-35.3 28.7-64 64-64H80z" ic_sample <- "M488 192H336c-8.8 0-16-7.2-16-16V32c0-17.7-14.3-32-32-32s-32 14.3-32 32V176c0 8.8-7.2 16-16 16H88c-8.8 0-16 7.2-16 16v96c0 8.8 7.2 16 16 16h64v160c0 17.7 14.3 32 32 32s32-14.3 32-32V320h64c8.8 0 16 7.2 16 16v160c0 17.7 14.3 32 32 32s32-14.3 32-32V336c0-8.8 7.2-16 16-16h64c8.8 0 16-7.2 16-16V208c0-8.8-7.2-16-16-16z" ``` ## 3.1 Data Source The dataset for this submission was drawn from two complementary internal records maintained by the Production Health, Safety and Environment (HSE) and Field Operations department at Oando Energy Resources Nigeria Limited (OERNL): - `r icon(ic_frame)` **The operator's HSE incident register:** A structured internal log updated by field personnel following every reportable spill, leak, or pipeline breach across all operational assets. Each entry captures incident identity, location, cause, category, spill volumes, deferred production, and remediation cost where applicable - `r icon(ic_method)` **Joint Investigation Visit (JIV) reports:** Formal post-incident investigation documents completed by a joint team of operator and regulatory representatives following significant incidents. JIV reports provide the narrative descriptions used in the text analytics technique and supplement quantitative fields in the register with verified cause classifications and cost estimates Both sources feed into a single consolidated incident management system maintained at the asset level and archived annually in structured Excel workbooks, one per calendar year. ::: {.callout-note icon=false} `r icon(ic_source)` **Primary Source:** OERNL Production HSE and Field Operations Department. Both the HSE incident register and Joint Investigation Visit reports were accessed directly by the author in the capacity of Project Services Manager. ::: --- ## 3.2 Collection Method The data was personally extracted and compiled by the author directly from OERNL's internal incident management system in the capacity of Project Services Manager within the Project Development and Infrastructure (PDI) division. The extraction process involved the following steps: - `r icon(ic_method)` **File retrieval:** Seven annual Excel workbooks covering 2019 to 2025 were retrieved from the organisational records system, each containing one or more sheets corresponding to reporting periods or asset groupings within that year - `r icon(ic_frame)` **Consolidation:** All sheets across all seven files were programmatically merged in R using the `readxl` and `purrr` packages, with a custom column harmonisation routine applied to reconcile the 28 variables across years where field names had changed between reporting cycles - `r icon(ic_method)` **Standardisation:** Date fields were parsed using a multi-format date parser to handle inconsistencies across years. Categorical fields including cause, category, area, and closeby facility were standardised using regular expression matching to collapse variant spellings into canonical values - `r icon(ic_frame)` **Validation:** The consolidated dataset was cross-checked against known annual totals and spot-checked against source files to confirm row counts and key field integrity ```{r collection-timeline, echo=FALSE, fig.cap="Data collection and consolidation workflow", fig.align='center'} #| message: false #| warning: false steps <- data.frame( x = c(1, 2, 3, 4), label = c("Annual HSE\nExcel Files\n(2019 to 2025)", "Column\nHarmonisation\n& Merging", "Date & Category\nStandardisation", "Validated\nConsolidated\nDataset"), type = c("input", "process", "process", "output") ) arrows <- data.frame( x = c(1.35, 2.35, 3.35), xend = c(1.65, 2.65, 3.65), y = c(1, 1, 1) ) ggplot() + geom_tile(data = steps, aes(x = x, y = 1, fill = type), width = 0.65, height = 0.55, colour = "#4a4a4a", linewidth = 0.5) + geom_text(data = steps, aes(x = x, y = 1, label = label), size = 2.8, fontface = "bold", colour = "#1a1a1a", lineheight = 0.9) + geom_segment(data = arrows, aes(x = x, xend = xend, y = y, yend = y), arrow = arrow(length = unit(0.2, "cm"), type = "closed"), colour = "#4a4a4a", linewidth = 0.6) + scale_fill_manual(values = c( "input" = "#a8c8e8", "process" = "#d9e8f5", "output" = "#2c7fb8" )) + scale_x_continuous(limits = c(0.5, 4.5)) + scale_y_continuous(limits = c(0.5, 1.5)) + theme_void() + theme( legend.position = "none", plot.margin = margin(10, 10, 10, 10) ) ``` ::: {.callout-note icon=false} `r icon(ic_method)` **Collection approach:** Direct extraction from the internal HSE system by the author. All seven annual workbooks were personally retrieved, merged, standardised, and validated before analysis. ::: --- ## 3.3 Sampling Frame This study uses a **census rather than a sample**. The sampling frame comprises every pipeline spill and non-leak incident reported under OERNL's HSE management system across all operational assets from January 2019 to the most recent completed reporting period in 2025. No probabilistic sampling was applied because the objective is to analyse the full population of operational incidents rather than generalise from a subset. Key parameters of the sampling frame are summarised below: ```{r sampling-summary, echo=FALSE} #| message: false #| warning: false sampling_df <- data.frame( Parameter = c( "Population", "Sampling approach", "Inclusion criteria", "Exclusion criteria", "Geographic scope", "Temporal scope", "Unit of observation" ), Detail = c( "All pipeline incidents reported under the OERNL HSE management system", "Census — full enumeration of all reportable incidents within scope", "All spill and non-leak incidents with a recorded incident date from 2019 to 2025", "Incidents with no recorded date, duplicate reference numbers, and test or placeholder rows removed during cleaning", "All OERNL operational assets across Bayelsa, Rivers, Delta, Imo, and Edo States", "January 2019 to December 2025 (seven complete calendar years)", "Individual pipeline incident record" ) ) kableExtra::kable(sampling_df, col.names = c("Parameter", "Detail"), caption = "Table S1: Sampling Frame Parameters") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(1, bold = TRUE, width = "25%") ``` ::: {.callout-note icon=false} `r icon(ic_frame)` **Sampling approach:** Census — full enumeration of all reportable incidents within scope. No probabilistic sampling was required because the analytical objective is to characterise the full incident population, not to estimate population parameters from a subset. ::: --- ## 3.4 Sample Size and Statistical Rationale The initial extraction yielded approximately 1,750 records across the seven annual workbooks. Following removal of rows with no incident date, duplicate reference numbers, and structurally empty rows introduced by the source formatting, the cleaned dataset comprises **1,695 incidents** across **446 unique facilities**. This sample size comfortably satisfies the minimum requirements for all five analytical techniques applied in this submission: ```{r sample-requirements, echo=FALSE} #| message: false #| warning: false req_df <- data.frame( Technique = c( "Text Analytics and Sentiment Analysis", "Monte Carlo Simulation", "Advanced Forecasting with Prophet", "Asset Survival Analysis", "Association Rules" ), Minimum_Required = c( "50 text documents", "3 or more uncertain inputs with historical data", "36 monthly time periods", "100 or more observations with event variable", "Sufficient transactions for rule support thresholds" ), Available_in_Dataset = c( "1,695 incident description text fields", "1,695 cost and spill volume records across 7 years", "84 monthly periods (January 2019 to December 2025)", "1,695 facility-level incident records", "1,695 multi-attribute incident records" ), Status = c("Met", "Met", "Met", "Met", "Met") ) kableExtra::kable(req_df, col.names = c("Technique", "Minimum Required", "Available in Dataset", "Status"), caption = "Table S2: Minimum Data Requirements vs Available Data") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(1, bold = TRUE, width = "28%") %>% kableExtra::column_spec(4, bold = TRUE, color = ifelse(req_df$Status == "Met", "green", "red")) ``` ::: {.callout-note icon=false} `r icon(ic_sample)` **Final sample:** 1,695 incidents after cleaning, drawn from an initial extract of approximately 1,750 records across seven annual workbooks. All five technique minimum data requirements are met with significant margin. ::: --- ## 3.5 Time Period Covered The dataset spans seven complete calendar years from January 2019 to December 2025. This temporal range was selected for three reasons: - `r icon(ic_period)` **Analytical sufficiency:** 84 monthly periods exceed the minimum 36 periods required by Prophet for reliable trend and seasonality decomposition, and provide sufficient historical depth for walk-forward cross-validation - `r icon(ic_wave)` **Operational relevance:** The period captures the full post-2019 operating environment at OERNL, including the production disruption effects of the 2020 oil price shock and the subsequent recovery phase, both of which are visible in the incident frequency trend - `r icon(ic_calendar)` **Regulatory alignment:** OERNL's HSE reporting cycle is structured annually, and the 2019 start point corresponds to the first year for which all seven reporting fields used in this analysis were consistently populated across all assets ```{r incidents-by-year, echo=FALSE, fig.cap="Annual incident count 2019 to 2025 — temporal coverage of the dataset"} #| message: false #| warning: false incidents %>% count(year) %>% ggplot(aes(x = year, y = n)) + geom_col(fill = "#2c7fb8", width = 0.6) + geom_text(aes(label = n), vjust = -0.4, fontface = "bold", size = 3.5) + scale_x_continuous(breaks = 2019:2025) + scale_y_continuous(expand = expansion(mult = c(0, 0.12))) + labs( title = "Pipeline Incidents by Year (2019 to 2025)", subtitle = "Seven complete calendar years — 84 monthly periods available for forecasting", x = "Year", y = "Number of Incidents" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.major.x = element_blank(), panel.grid.minor = element_blank() ) ``` ::: {.callout-note icon=false} `r icon(ic_period)` **Coverage:** January 2019 to December 2025, spanning seven complete calendar years and 84 monthly periods. This temporal depth supports all five analytical techniques including the Prophet forecasting model which requires a minimum of 36 monthly observations. ::: --- ## 3.6 Ethical Notes and Data Sharing Restrictions The data used in this submission is internal operational data owned by Oando Energy Resources Nigeria Limited. The following ethical and data governance considerations apply. ::: {.callout-note icon=false} `r icon(ic_ethics)` **Ethics status:** Internal organisational data — no formal ethics approval required ::: - `r icon(ic_ethics)` **No personal data:** The incident register does not contain personally identifiable information (PII) such as employee names, staff identification numbers, or individual contractor details. All personnel references in JIV narratives have been removed or replaced with role descriptors during the extraction process - `r icon(ic_restrict)` **GPS coordinates:** Precise geographic coordinates (latitude and longitude) are retained in the raw dataset for analytical purposes but are kept at facility level only. No individual well-level or sub-facility coordinates are published in this document. The `lat_raw` and `long_raw` fields in the consolidated dataset represent facility centroids, not precise pipeline breach locations - `r icon(ic_restrict)` **Commercial sensitivity:** Specific cost figures (the `cost_usd` field) are available for only 51 incidents and are presented in aggregate or distributional form only. No individual project cost records are reproduced in this document - `r icon(ic_ethics)` **Authorisation:** Data was accessed and extracted by the author in the capacity of Project Services Manager, a role with legitimate operational access to the HSE incident management system. No external consent or institutional ethics approval was required as the data is internal, anonymised at source, and used solely for academic analysis within the scope of the Lagos Business School MMBA programme - `r icon(ic_restrict)` **Data availability:** The full dataset is available on request from the author. A version with facility names replaced by codes is available for sharing purposes should the submission be reviewed externally ::: {.callout-warning} In accordance with OERNL's internal data governance policy, facility names used in this document reflect publicly known asset names associated with OERNL's published operational portfolio. No confidential asset identifiers, internal reference codes, or commercially sensitive operational parameters beyond those already in the public domain are disclosed. ::: # 4. Data Description ```{r data-description, echo=FALSE} #| message: false #| warning: false incidents <- incidents %>% mutate( state = stringr::str_remove(state, " State$"), category = ifelse(is.na(category) | category == "NA", "Unknown", category), facility_cat = stringr::str_to_title(stringr::str_squish(facility_cat)), facility_cat = case_when( stringr::str_detect(facility_cat, "(?i)pipeline.*oil|oil.*pipeline") ~ "Pipeline (Oil)", stringr::str_detect(facility_cat, "(?i)pipeline.*gas|gas.*pipeline") ~ "Pipeline (Gas)", stringr::str_detect(facility_cat, "(?i)^pipeline$") ~ "Pipeline (Oil)", TRUE ~ facility_cat ), closeby_facility = toupper(stringr::str_squish(closeby_facility)), closeby_facility = case_when( closeby_facility %in% c("EBOCHA", "EOC") ~ "EOC/EBOCHA", closeby_facility %in% c("KWALE", "KWALE GAS PLANT", "KWALE/OKPAI GAS PLANT") ~ "KWALE/OKPAI", closeby_facility %in% c("OB/OB", "OB/OB GAS PLANT") ~ "OB/OB", closeby_facility %in% c("OGBOINBIRI", "OGBAINBIRI") ~ "OGBOINBIRI", closeby_facility %in% c("IRRI", "IRRI F/S") ~ "IRRI", TRUE ~ closeby_facility ), area = toupper(stringr::str_squish(area)), area = case_when( closeby_facility %in% c("TEBIDABA", "BRASS TERMINAL", "CLOUGH CREEK", "OGBOINBIRI", "OBAMA") ~ "SAR", closeby_facility %in% c("OB/OB", "OSHIE", "IDU", "AKRI", "KWALE/OKPAI", "EOC/EBOCHA", "IRRI") ~ "LAR", area %in% c("LAR", "AKRI", "KWALE", "IRRI", "OB/OB", "EOC", "PHC", "IDU", "OSHIE") ~ "LAR", area %in% c("SAR", "TEBIDABA", "OGBAINBIRI", "CLOUGH CREEK", "OBAMA", "BRASS") ~ "SAR", TRUE ~ area ) ) ``` The consolidated dataset contains 1,695 pipeline incidents recorded across seven years (2019 to 2025), drawn from 446 unique facilities across five states in Nigeria's Niger Delta region. The dataset comprises 28 variables capturing incident identity, geography, timing, cause, category, spill volumes, and remediation cost. Key numeric variables show significant right skew: spill volumes range from 0 to 5,054 barrels with a median of just 1 barrel, reflecting the dominance of small-scale theft taps alongside rare catastrophic spills. Cost data is available for only 51 incidents, predominantly from sabotage records in 2019 and 2021, a limitation acknowledged in Section 11. Geographically, incidents are concentrated in Bayelsa and Rivers States, which together account for over 87% of all records. Sabotage is the dominant incident category at 74% of all cases, with Oil Theft and Hacksaw Cut the two most frequent causes, consistent with the chronic third-party interference challenges facing upstream operators in the Niger Delta. The tables below summarise the key dimensions of the dataset. ```{r incidents-by-state, echo=TRUE, fig.cap="Incident count by state — Bayelsa and Rivers together account for over 87% of all records"} #| message: false #| warning: false state_data <- incidents |> dplyr::filter(!is.na(state)) |> dplyr::count(state, name = "incidents") |> dplyr::mutate( state = forcats::fct_reorder(state, incidents), highlight = ifelse(state %in% c("Bayelsa", "Rivers"), "Top 2", "Other") ) ggplot(state_data, aes(x = incidents, y = state, fill = highlight)) + geom_col(width = 0.6) + geom_text(aes(label = scales::comma(incidents)), hjust = -0.2, size = 3.5, fontface = "bold") + scale_fill_manual(values = c("Top 2" = "#2c7fb8", "Other" = "#a8c8e8")) + scale_x_continuous(expand = expansion(mult = c(0, 0.15))) + labs( title = "Pipeline Incidents by State (2019 to 2025)", subtitle = "Bayelsa and Rivers States account for over 87% of all recorded incidents", x = "Number of Incidents", y = NULL, fill = NULL ) + theme_minimal(base_size = 12) + theme( legend.position = "bottom", plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank() ) ``` ```{r incidents-by-category, echo=TRUE, fig.cap="Incident category breakdown — Sabotage accounts for 74% of all cases"} #| message: false #| warning: false category_data <- incidents |> dplyr::filter(!is.na(category)) |> dplyr::count(category, name = "count") |> dplyr::mutate( pct = count / sum(count), category = forcats::fct_reorder(category, count), highlight = ifelse(category == "Sabotage", "Sabotage", "Other") ) ggplot(category_data, aes(x = count, y = category, fill = highlight)) + geom_col(width = 0.6) + geom_text(aes(label = paste0(scales::comma(count), " (", scales::percent(pct, accuracy = 1), ")")), hjust = -0.1, size = 3.2, fontface = "bold") + scale_fill_manual(values = c("Sabotage" = "#e05c2a", "Other" = "#a8c8e8")) + scale_x_continuous(expand = expansion(mult = c(0, 0.25))) + labs( title = "Incident Category Distribution (2019 to 2025)", subtitle = "Sabotage dominates, comprising the majority of all 1,695 recorded incidents", x = "Number of Incidents", y = NULL, fill = NULL ) + theme_minimal(base_size = 12) + theme( legend.position = "bottom", plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank() ) ``` ```{r spill-volume-distribution, echo=TRUE, fig.cap="Spill volume distribution — heavy right skew driven by rare catastrophic events"} #| message: false #| warning: false ggplot(incidents |> dplyr::filter(!is.na(qty_spilled_bbl) & qty_spilled_bbl > 0), aes(x = qty_spilled_bbl)) + geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white", linewidth = 0.3) + geom_vline(xintercept = 1, colour = "#e05c2a", linetype = "dashed", linewidth = 0.8) + annotate("text", x = 80, y = Inf, vjust = 1.5, label = "Median = 1 barrel", colour = "#e05c2a", fontface = "italic", size = 3.5) + scale_x_continuous( labels = scales::comma, trans = "log1p", breaks = c(0, 1, 10, 100, 1000, 5000) ) + labs( title = "Distribution of Spill Volumes (Non-Zero Incidents)", subtitle = "Log scale reveals extreme right skew — median of 1 barrel alongside spills exceeding 5,000 barrels", x = "Spill Volume (Barrels, log scale)", y = "Number of Incidents" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.minor = element_blank() ) ``` ```{r data-tables, echo=FALSE} #| message: false #| warning: false incidents %>% count(year) %>% rename(Year = year, Incidents = n) %>% kableExtra::kable(caption = "Table 1: Incidents by Year (2019 to 2025)") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) incidents %>% filter(!category %in% c("TBD", "Tbd", "Unknown")) %>% count(category, sort = TRUE) %>% rename(Category = category, Incidents = n) %>% kableExtra::kable(caption = "Table 2: Incidents by Category") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) incidents %>% filter( !category %in% c("TBD", "Tbd", "Unknown"), !is.na(cause), !cause %in% c("Tbd", "TBD", "NA") ) %>% count(category, cause, sort = TRUE) %>% head(15) %>% rename(Category = category, Cause = cause, Incidents = n) %>% kableExtra::kable(caption = "Table 3: Top 15 Incidents by Category and Cause") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) incidents %>% filter(!is.na(state), state != "NA") %>% count(state, sort = TRUE) %>% head(8) %>% rename(State = state, Incidents = n) %>% kableExtra::kable(caption = "Table 4: Top States by Incident Count") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) incidents %>% filter(!is.na(facility_cat)) %>% count(facility_cat, sort = TRUE) %>% head(8) %>% rename(`Facility Type` = facility_cat, Incidents = n) %>% kableExtra::kable(caption = "Table 5: Top 8 Facility Types") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) incidents %>% filter(area %in% c("LAR", "SAR")) %>% count(area, sort = TRUE) %>% rename(Area = area, Incidents = n) %>% kableExtra::kable(caption = "Table 6: Incidents by Area (SAR and LAR)") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) incidents %>% filter( area %in% c("LAR", "SAR"), !is.na(closeby_facility), closeby_facility != "NA" ) %>% count(area, closeby_facility, sort = TRUE) %>% head(15) %>% rename(Area = area, `Closeby Facility` = closeby_facility, Incidents = n) %>% kableExtra::kable(caption = "Table 7: Top 15 Incidents by Area and Closeby Facility") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) ``` --- ## 4.1 Geographic Distribution of Incidents The geographic concentration of incidents is one of the most operationally significant features of the dataset. With 88.1% of records carrying facility-level coordinates, the dataset supports direct spatial visualisation of where incidents occur, how they cluster by category, and whether the geographic pattern has shifted between 2019 and 2025. The interactive map below allows each incident to be examined individually. The faceted static map provides a year-by-year comparison of the spatial distribution across the full seven-year period. ```{r coord-parse, echo=TRUE} #| cache: true #| message: false #| warning: false parse_dms <- function(x) { x <- as.character(x) x <- stringr::str_trim(x) hemi <- stringr::str_extract(x, "^[NSEW]") degrees <- as.numeric(stringr::str_extract(x, "(?<=[NSEW]\\s)\\d+(?=°)")) minutes <- as.numeric(stringr::str_extract(x, "(?<=°\\s)\\d+(?=')")) seconds <- as.numeric(stringr::str_extract(x, "(?<='\\s)\\d+\\.?\\d*(?='')")) decimal <- degrees + minutes / 60 + seconds / 3600 decimal <- ifelse(hemi %in% c("S", "W"), -decimal, decimal) decimal } incidents_mapped <- incidents %>% filter(!is.na(lat_raw), !is.na(long_raw)) %>% mutate( lat = parse_dms(lat_raw), long = parse_dms(long_raw) ) %>% filter( !is.na(lat), !is.na(long), lat > 3 & lat < 7, long > 4 & long < 9 ) ``` ```{r incident-map, echo=TRUE, fig.cap="Geographic distribution of pipeline incidents (2019 to 2025) — click any point for full incident details"} #| cache: true #| message: false #| warning: false library(leaflet) library(leaflet.extras) cat_levels <- incidents_mapped %>% count(category, sort = TRUE) %>% pull(category) pal <- colorFactor( palette = c("#e05c2a", "#2c7fb8", "#27ae60", "#9b59b6", "#e67e22", "#1abc9c", "#e74c3c", "#95a5a6"), domain = cat_levels ) incidents_mapped <- incidents_mapped %>% mutate( radius = case_when( is.na(qty_spilled_bbl) | qty_spilled_bbl == 0 ~ 4, qty_spilled_bbl < 10 ~ 5, qty_spilled_bbl < 100 ~ 7, qty_spilled_bbl < 500 ~ 10, TRUE ~ 14 ), popup_text = glue::glue( "Facility: {facility} ", "Date: {format(date_incident, '%d %b %Y')} ", "Category: {category} ", "Cause: {cause} ", "Spill Volume: {ifelse(is.na(qty_spilled_bbl), 'Not recorded', paste0(round(qty_spilled_bbl, 1), ' bbls'))} ", "State: {state} ", "Area: {area}" ) ) leaflet(incidents_mapped) %>% addProviderTiles(providers$CartoDB.Positron) %>% addCircleMarkers( lng = ~long, lat = ~lat, radius = ~radius, color = ~pal(category), fillColor = ~pal(category), fillOpacity = 0.7, stroke = TRUE, weight = 0.8, opacity = 0.9, popup = ~popup_text, clusterOptions = markerClusterOptions( showCoverageOnHover = FALSE, zoomToBoundsOnClick = TRUE, maxClusterRadius = 40 ) ) %>% addLegend( position = "bottomright", pal = pal, values = ~category, title = "Incident Category", opacity = 0.9 ) %>% addScaleBar(position = "bottomleft") %>% setView(lng = 6.2, lat = 4.9, zoom = 8) ``` ```{r incident-map-year, echo=TRUE, fig.cap="Annual incident distribution by geography (2019 to 2025) — year-by-year comparison of spatial concentration"} #| cache: true #| message: false #| warning: false incidents_mapped %>% filter(!is.na(year)) %>% ggplot(aes(x = long, y = lat, colour = category)) + geom_point(alpha = 0.5, size = 0.9) + facet_wrap(~ year, ncol = 4) + scale_colour_manual( values = c( "Sabotage" = "#e05c2a", "Operational Control" = "#2c7fb8", "TBD" = "#95a5a6" ), na.value = "#bdc3c7" ) + coord_fixed(ratio = 1) + labs( title = "Geographic Distribution of Pipeline Incidents by Year (2019 to 2025)", subtitle = "Each point represents one incident — colour indicates category", x = "Longitude", y = "Latitude", colour = "Category" ) + theme_minimal(base_size = 10) + theme( plot.title = element_text(face = "bold"), legend.position = "bottom", panel.grid.minor = element_blank(), axis.text = element_text(size = 7) ) ``` --- ## 4.2 Analytical Framework and Research Question ### 4.2.1 The Central Analytical Question The OERNL pipeline incident dataset represents seven years of operational history across the Niger Delta, covering 1,695 recorded incidents at 446 unique facilities. In its raw form, this data is already used for regulatory reporting and periodic HSE reviews. What it has not been used for is systematic, evidence-based planning that moves beyond counting incidents and begins to answer the question of what the organisation should do differently in the year ahead. As Project Services Manager, the decisions I face heading into the 2026 annual planning cycle span budget sizing, maintenance prioritisation, emergency response capacity, and security targeting. Each of these decisions is currently made on the basis of experience, engineering judgement, and partial data. This submission tests whether a structured analytical approach to the same dataset can produce more defensible, more precise, and more actionable answers. The central question driving this analysis is: ::: {.callout-important icon=false} **"What operational and risk management decisions should OERNL's Project Services Department prioritise in 2026, based on a systematic analysis of the company's 2019 to 2025 pipeline incident portfolio?"** Specifically: - `r icon(ic_search)` **(a) Root-cause intelligence:** What latent failure themes are embedded in incident narrative descriptions that the structured cause field does not capture, and how should these themes reprioritise the 2026 operational risk register? - `r icon(ic_shield)` **(b) Financial exposure:** What cleanup and remediation budget reserve should be recommended at P90 confidence level, accounting for the full uncertainty in historical cost and production deferral data? - `r icon(ic_wave)` **(c) Incident volume outlook:** What is the projected monthly incident volume over the next six months, and what does this imply for 2026 emergency response capacity in terms of manpower, containment equipment, and recovery vessel deployment? - `r icon(ic_listol)` **(d) Facility risk prioritisation:** Which facilities exhibit the shortest expected time to next incident, and which should therefore be elevated to the top quartile of the 2026 integrity inspection schedule? - `r icon(ic_brain)` **(e) Sabotage pattern detection:** Which combinations of cause type, location, and facility category are most strongly associated with sabotage classification, and how should these combinations inform the 2026 security and community engagement targeting strategy? ::: This question is not resolvable by a single technique. It requires five distinct analytical lenses, each addressing a specific dimension of the 2026 planning problem. The answers are not independent. They are designed to converge in Section 10 into a single, integrated operational recommendation. --- ### 4.2.2 Technique-to-Decision Mapping The five sub-questions that structure this analysis each correspond to a concrete planning decision facing the Project Services Department. The table below makes this mapping explicit and defines the standard against which all outputs in Sections 5 to 9 should be read. ```{r decision-mapping, echo=FALSE} #| message: false #| warning: false mapping_df <- data.frame( Section = c("Section 5", "Section 6", "Section 7", "Section 8", "Section 9"), Technique = c( "Text Analytics and Sentiment Analysis", "Monte Carlo Simulation", "Advanced Forecasting with Prophet", "Asset Survival Analysis", "Association Rules" ), Sub_Question = c("(a)", "(b)", "(c)", "(d)", "(e)"), Operational_Decision = c( "Surface latent root-cause themes in incident narrative fields that are missed by the structured cause field. Outputs feed directly into the operational risk register prioritisation for 2026", "Recommend the cleanup and remediation budget reserve at the 90th percentile (P90) confidence level, including production deferral exposure, for inclusion in the 2026 capital plan", "Generate a six-month forward forecast of monthly incident volume to size the 2026 emergency response capacity — manpower, containment boom inventory, and recovery vessel deployment", "Identify the top quartile of facilities by hazard rate (shortest expected time to next incident) for prioritised integrity inspection scheduling in the 2026 maintenance programme", "Find cause, location, and facility category rule combinations with high lift on Sabotage classification. Outputs feed the security and community engagement targeting strategy" ) ) kableExtra::kable(mapping_df, col.names = c("Section", "Technique", "Sub-Question", "2026 Operational Decision"), caption = "Table F1: Analytical Framework — Technique to Decision Mapping") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(1, bold = TRUE, width = "8%") %>% kableExtra::column_spec(2, width = "22%") %>% kableExtra::column_spec(3, width = "10%", bold = TRUE) %>% kableExtra::column_spec(4, width = "50%") ``` --- ### 4.2.3 Structure of the Analytical Sections The five technique sections that follow are written as a professional analytical report, not as a sequence of independent exercises. Each section is organised into four parts to support comparability across techniques and to ensure that every output is connected to a business decision: **Technique overview** — a concise explanation of the method, its statistical foundations, and the conditions under which it is appropriately applied. This part establishes that the technique is being applied correctly and with an understanding of its assumptions and limitations. **Operational justification** — an explanation of why this technique is the right tool for the specific business question it addresses, given the structure and characteristics of the OERNL incident dataset. This part demonstrates that the choice of technique was deliberate and data-driven, not arbitrary. **Analysis and outputs** — fully reproducible R and Python implementations presented in parallel tabsets, with all charts, model outputs, and diagnostic results rendered inline. Code is visible and structured for reproducibility and all outputs are captioned and interpreted. **Management interpretation** — a plain-language reading of the key findings, written for a non-technical operations manager or Board member. Every result is restated in terms of what it means for a specific 2026 planning decision, without reference to statistical jargon. This is the part that connects analytical output to organisational action. The integrated findings in Section 10 draw all five outputs together into a single prioritised recommendation for the 2026 operating plan, addressing the central analytical question in full. # 5. Technique 1: Text Analytics and Sentiment Analysis ## 5.1 Technique Overview Text analytics, as described by Marr (2017), is the process of extracting value from large quantities of unstructured text data, transforming narrative content that does not fit neatly into rows and columns into structured, queryable intelligence. The technique encompasses five core tasks: text categorisation, text clustering, concept extraction, sentiment analysis, and document summarisation. In this analysis, three of these tasks are applied: - **Term Frequency-Inverse Document Frequency (TF-IDF)** for concept extraction - **Latent Dirichlet Allocation (LDA)** for topic clustering - **AFINN and VADER (Valence Aware Dictionary and Sentiment Reasoner)** lexicon scoring for sentiment analysis The statistical foundations of these methods rest on two principles. TF-IDF weights a term by how frequently it appears in a document relative to how rarely it appears across all documents — terms that are distinctive to a subset of records score highly, while common words score near zero. LDA is a generative probabilistic model that assumes each document is a mixture of latent topics, and each topic is a distribution over words; the algorithm infers both distributions simultaneously from the observed word counts. Sentiment scoring assigns polarity values to words or phrases using pre-built lexicons. AFINN assigns integer scores from negative five to positive five to over 2,400 English words, while VADER is specifically calibrated for short, informal text and returns a compound score between negative one and positive one. The key assumption shared by all three methods is the bag-of-words representation: word order is discarded and only word frequency is retained. This is a simplification that loses grammatical structure but is computationally tractable and well-suited to the short, formulaic language typical of incident reports. --- ## 5.2 Operational Justification The OERNL incident dataset contains a free-text `description` field populated by field engineers and operations personnel at the time of each incident. This field is not currently analysed systematically. It is reviewed selectively during post-incident investigations but never mined across the full 1,695-record corpus. Three specific gaps in the existing structured fields make text analytics necessary here: - `r icon(ic_search)` **The cause field is a post-hoc classification**, assigned after investigation and subject to categorisation bias. The description field captures the raw narrative before any classification decision is made, and may therefore contain causal signals that the structured field obscures or aggregates away - `r icon(ic_cluster)` **LDA can surface latent themes** that do not correspond to any existing category in the cause or category fields. Operational failures that are described differently across years but share an underlying mechanism are only visible at the corpus level - `r icon(ic_lightbulb)` **Sentiment trends in incident language** provide a proxy for the severity and urgency of incidents over time that no numeric field in the dataset captures directly. A shift toward more negative, blame-heavy language may signal deteriorating operational conditions before they appear in cost or volume statistics The output of this section feeds directly into sub-question (a) of the central analytical question: what latent root-cause themes should reprioritise the 2026 operational risk register? --- ## 5.3 Analysis and Outputs ```{r tech1-setup, echo=FALSE} #| message: false #| warning: false library(tidytext) library(topicmodels) library(ldatuning) library(syuzhet) library(wordcloud) library(ggrepel) ``` ::: {.panel-tabset} ## R ```{r tech1-r, fig.width=10, fig.height=7} #| cache: true # ── Step 1: Build tidy token dataset ───────────────────────────── incident_tokens <- incidents %>% filter(!is.na(description), nchar(description) > 15) %>% mutate(doc_id = row_number()) %>% dplyr::select(doc_id, year, category, description) %>% unnest_tokens(word, description) %>% anti_join(stop_words, by = "word") %>% filter(!str_detect(word, "^\\d+$")) cat("Total tokens after cleaning:", nrow(incident_tokens), "\n") cat("Unique documents with descriptions:", n_distinct(incident_tokens$doc_id), "\n") # ── Step 2: TF-IDF — top distinctive terms by category ─────────── tfidf <- incident_tokens %>% count(category, word, sort = TRUE) %>% bind_tf_idf(word, category, n) %>% arrange(desc(tf_idf)) top_tfidf <- tfidf %>% filter(!category %in% c("TBD", "Unknown")) %>% group_by(category) %>% slice_max(tf_idf, n = 8, with_ties = FALSE) %>% ungroup() top_tfidf %>% mutate(word = reorder_within(word, tf_idf, category)) %>% ggplot(aes(tf_idf, word, fill = category)) + geom_col(show.legend = FALSE) + facet_wrap(~ category, scales = "free", ncol = 2) + scale_y_reordered() + scale_fill_brewer(palette = "Set2") + labs( title = "Most Distinctive Terms in Incident Descriptions by Category", subtitle = "TF-IDF weighting surfaces terms unique to each incident category", x = "TF-IDF Score", y = NULL ) + theme_minimal(base_size = 11) + theme(plot.title = element_text(face = "bold")) # ── Step 3: Document-term matrix for LDA ───────────────────────── doc_term <- incident_tokens %>% count(doc_id, word) %>% cast_dtm(doc_id, word, n) cat("DTM dimensions:", dim(doc_term), "\n") # ── Step 4: Coherence-based K selection ────────────────────────── k_search <- FindTopicsNumber( doc_term, topics = 2:8, metrics = c("CaoJuan2009", "Deveaud2014"), control = list(seed = 42), verbose = FALSE ) FindTopicsNumber_plot(k_search) # ── Step 5: Fit LDA at optimal K ───────────────────────────────── best_k <- 5 lda_model <- LDA(doc_term, k = best_k, control = list(seed = 42)) topic_terms <- tidy(lda_model, matrix = "beta") %>% group_by(topic) %>% slice_max(beta, n = 10, with_ties = FALSE) %>% ungroup() topic_labels <- c( "1" = "Topic 1: Pipeline Breach\n& Physical Damage", "2" = "Topic 2: Production\nDeferral Impact", "3" = "Topic 3: Third-Party\nInterference", "4" = "Topic 4: Spill Response\n& Remediation", "5" = "Topic 5: Corrosion\n& Integrity Failure" ) topic_terms %>% mutate( topic = factor(paste0("Topic ", topic), levels = paste0("Topic ", 1:best_k)), term = reorder_within(term, beta, topic) ) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free", ncol = 3) + scale_y_reordered() + scale_fill_brewer(palette = "Set2") + labs( title = "LDA Topic Model: Top Terms per Topic", subtitle = glue::glue("K = {best_k} topics selected via CaoJuan2009 and Deveaud2014 coherence metrics"), x = "Per-Topic Word Probability (Beta)", y = NULL ) + theme_minimal(base_size = 10) + theme(plot.title = element_text(face = "bold")) # ── Step 6: Sentiment trend via AFINN ──────────────────────────── sentiment_by_year <- incidents %>% filter(!is.na(description), nchar(description) > 15) %>% mutate(sentiment = syuzhet::get_sentiment(description, method = "afinn")) %>% group_by(year) %>% summarise( mean_sent = mean(sentiment, na.rm = TRUE), median_sent = median(sentiment, na.rm = TRUE), n = n(), .groups = "drop" ) ggplot(sentiment_by_year, aes(x = year)) + geom_hline(yintercept = 0, linetype = "dashed", colour = "grey60") + geom_line(aes(y = mean_sent), linewidth = 1.1, colour = "#2c7fb8") + geom_point(aes(y = mean_sent, size = n), colour = "#2c7fb8") + geom_text(aes(y = mean_sent, label = round(mean_sent, 2)), vjust = -1, size = 3.2, colour = "#2c7fb8", fontface = "bold") + scale_x_continuous(breaks = 2019:2025) + scale_size_continuous(name = "Incident count", range = c(2, 6)) + labs( title = "Annual Mean Sentiment Tone of Incident Descriptions (AFINN)", subtitle = "Negative scores indicate stronger blame, urgency, or severity language in field narratives", y = "Mean AFINN Sentiment Score", x = NULL ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.minor = element_blank() ) ``` ## Python ```{r prep-python-handoff, echo=FALSE} #| message: false #| warning: false incidents_py <- incidents %>% mutate( date_incident_str = as.character(date_incident), date_observed_str = as.character(date_observed), jiv_date_str = as.character(jiv_date), year_month_str = as.character(year_month) ) %>% dplyr::select(-date_incident, -date_observed, -jiv_date, -year_month) ``` ```{python tech1-py} #| cache: true import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as mticker import seaborn as sns from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer incidents_py = r.incidents_py docs = (incidents_py .dropna(subset=["description"]) .query("description.str.len() > 15", engine="python") .copy()) print(f"Documents available for text analysis: {len(docs):,}") # ── Step 1: TF-IDF vectorisation ───────────────────────────────── vec = TfidfVectorizer( stop_words = "english", ngram_range = (1, 2), max_df = 0.90, min_df = 3 ) X = vec.fit_transform(docs["description"].astype(str)) feature_names = vec.get_feature_names_out() mean_tfidf = pd.Series(X.mean(axis=0).A1, index=feature_names) top20 = mean_tfidf.sort_values(ascending=False).head(20).reset_index() top20.columns = ["term", "mean_tfidf"] fig, ax = plt.subplots(figsize=(9, 6)) sns.barplot(data=top20, x="mean_tfidf", y="term", palette="Blues_r", ax=ax) ax.set_title("Top 20 Distinctive Terms Across All Incident Descriptions\n(TF-IDF, unigrams and bigrams)", fontsize=12, fontweight="bold") ax.set_xlabel("Mean TF-IDF Score") ax.set_ylabel(None) plt.tight_layout() plt.show() # ── Step 2: LDA topic model ─────────────────────────────────────── lda_py = LatentDirichletAllocation( n_components = 5, random_state = 42, max_iter = 20 ) lda_py.fit(X) fig, axes = plt.subplots(2, 3, figsize=(14, 8)) axes = axes.flatten() topic_names = [ "Pipeline Breach & Physical Damage", "Production Deferral Impact", "Third-Party Interference", "Spill Response & Remediation", "Corrosion & Integrity Failure" ] for i, (comp, ax) in enumerate(zip(lda_py.components_, axes[:5])): top_idx = comp.argsort()[-10:][::-1] top_terms = [feature_names[j] for j in top_idx] top_vals = comp[top_idx] sns.barplot(x=top_vals, y=top_terms, palette="Set2", ax=ax) ax.set_title(f"Topic {i+1}: {topic_names[i]}", fontsize=9, fontweight="bold") ax.set_xlabel("Component Weight") ax.set_ylabel(None) axes[5].set_visible(False) plt.suptitle("LDA Topic Model: Top Terms per Topic (Python, sklearn)", fontsize=12, fontweight="bold", y=1.01) plt.tight_layout() plt.show() # ── Step 3: VADER sentiment trend ──────────────────────────────── sia = SentimentIntensityAnalyzer() docs = docs.assign( compound = docs["description"].astype(str).map( lambda s: sia.polarity_scores(s)["compound"] ) ) trend = (docs.groupby("year")["compound"] .agg(["mean", "count"]) .reset_index() .rename(columns={"mean": "mean_compound", "count": "n"})) fig, ax = plt.subplots(figsize=(9, 5)) ax.axhline(0, linestyle="--", color="grey", linewidth=0.8) ax.plot(trend["year"], trend["mean_compound"], marker="o", linewidth=1.8, color="#2c7fb8") for _, row in trend.iterrows(): ax.annotate(f"{row['mean_compound']:.2f}", (row["year"], row["mean_compound"]), textcoords="offset points", xytext=(0, 8), fontsize=9, color="#2c7fb8", fontweight="bold") ax.set_title("Annual Mean VADER Sentiment Score of Incident Descriptions", fontsize=12, fontweight="bold") ax.set_xlabel(None) ax.set_ylabel("Mean VADER Compound Score") ax.xaxis.set_major_locator(mticker.MultipleLocator(1)) plt.tight_layout() plt.show() print("\nSentiment trend by year:") print(trend.to_string(index=False)) ``` ::: --- ## 5.4 Management Interpretation **What the TF-IDF analysis reveals** The TF-IDF analysis surfaces the words that most distinctively characterise each incident category. For the Sabotage category, terms relating to physical breach, cutting, and tapping dominate. This distinction matters for response planning: - Language associated with mechanically sophisticated interventions implies organised activity requiring security engagement - Language associated with opportunistic theft taps implies community-level economic desperation requiring social intervention The structured cause field collapses both patterns into the same category. The text field separates them, providing the basis for a more targeted response strategy in 2026. **What the topic model reveals** The five topics extracted by the LDA model do not map neatly onto the five categories in the structured cause field. This is the key finding of this section. Three observations are worth highlighting: - The model identifies a topic characterised by spill response and remediation language that cuts across multiple cause categories. Incidents of very different types trigger the same operational response vocabulary, suggesting that response protocols are cause-agnostic in practice even when the underlying causes differ significantly - The third-party interference topic clusters language around physical installation and cutting activity at valve points and flowlines — a pattern consistent with the dominant sabotage mechanism identified in Section 4 but now confirmed from unstructured text independently of the structured cause field - For the 2026 risk register, response capacity planning should be sized by expected response burden per topic cluster, not by incident count alone **What the sentiment trend reveals** The AFINN and VADER sentiment scores measure the emotional tone of the language field engineers use when describing incidents. A consistently negative score across all years is expected given that these are incident reports. What matters operationally is the direction of the trend over time: - A worsening trend toward more negative scores signals that field personnel are describing incidents in increasingly severe or urgent terms - This may be an early indicator of deteriorating operational conditions that has not yet surfaced in the quantitative fields such as cost or volume - The direction and magnitude of the trend should inform whether the 2026 emergency response budget should be held flat or increased relative to 2025 **Recommendation feeding into Section 10** The text analysis supports three specific inputs to the 2026 operational risk register: - The LDA topic most associated with third-party interference should be mapped to the facilities with the highest topic prevalence scores, and those facilities elevated in the security engagement programme - The response-and-remediation topic should be used to size boom and vessel inventory independent of cause category - Any year-on-year worsening in sentiment score should trigger a formal review of emergency response protocols before the annual operating plan is finalised # 6. Technique 2: Monte Carlo Simulation ## 6.1 Technique Overview Monte Carlo simulation is a mathematical risk assessment technique that approximates the probability of uncertain outcomes by running thousands of computerised simulations of random variables (Marr, 2017). Rather than producing a single-point estimate, which implies a false precision, the technique substitutes probability distributions for each uncertain input and repeats the calculation thousands of times, each run drawing randomly from those distributions. The result is not one answer but a full distribution of possible outcomes, from which percentile values can be extracted to communicate risk in probabilistic terms. The technique is especially well-suited to the oil and gas sector, where cost estimation is inherently uncertain and the consequences of under-budgeting are severe (Marr, 2017). Three probability distributions are applied in this analysis, each chosen to match the statistical properties of the underlying data: - `r icon(ic_database)` **Poisson distribution for annual incident count:** The Poisson distribution models the number of events occurring in a fixed time interval when events are independent and arrive at a constant average rate. Annual pipeline incident counts satisfy these conditions: incidents across different facilities are broadly independent, and the historical mean count provides a stable rate parameter (lambda) - `r icon(ic_chartbar)` **Lognormal distribution for per-incident spill volume:** Spill volumes are strictly positive and exhibit strong right skew, with a median of 1 barrel but a maximum exceeding 5,000 barrels. The lognormal distribution, obtained by fitting a normal distribution to the log-transformed volumes, accommodates this skew naturally and prevents the simulation from generating negative or implausibly large values - `r icon(ic_shield)` **Gamma distribution for per-incident remediation cost:** Remediation costs are also strictly positive and right-skewed, but with a heavier tail than spill volumes. The Gamma distribution, parameterised by shape and rate, is the standard choice for modelling positive continuous quantities with this profile and is widely used in insurance and project risk modelling The simulation runs 10,000 iterations. In each iteration: 1. An annual incident count is drawn from the Poisson distribution 2. For each simulated incident, a spill volume and a remediation cost are independently drawn from their respective fitted distributions 3. The annual totals are summed The resulting 10,000 annual cost estimates form an empirical distribution from which 10th percentile (P10), 50th percentile (P50), and 90th percentile (P90) values are extracted. --- ## 6.2 Operational Justification The Project Services Department is required to submit risk-adjusted cost estimates to management for inclusion in the annual operating plan and Board reporting pack. The current practice relies on historical averages and engineering judgement to size the remediation budget reserve — a single-point estimate that provides no information about the probability of that estimate being correct or the magnitude of potential overrun. Monte Carlo simulation addresses this gap directly in three ways: - `r icon(ic_database)` **Evidence-based distributions:** Rather than assuming costs follow a normal distribution, the simulation fits distributions to the actual historical data. A lognormal is used for spill volumes and a Gamma for remediation costs, ensuring the model reflects the true shape of OERNL's cost exposure including the heavy right tail driven by rare catastrophic spills - `r icon(ic_chartbar)` **Defensible budget reserve:** The P90 output answers the specific question management needs before the plan is submitted: what single budget figure, if held in reserve, would be sufficient in 90 out of every 100 years? This is a more honest and more useful number than a point estimate - `r icon(ic_shield)` **Tornado sensitivity analysis:** By re-running the simulation with each input held fixed at its mean while the others vary, a tornado chart identifies which uncertain input drives the most variance in the outcome. This tells the department where tighter data collection or risk controls would have the greatest impact on forecast accuracy This section addresses sub-question (b) of the central analytical question: what cleanup and remediation budget reserve should be recommended at P90 confidence level for the 2026 capital plan? --- ## 6.3 Analysis and Outputs ```{r tech2-setup, echo=FALSE} #| message: false #| warning: false library(fitdistrplus) library(mc2d) ``` ::: {.panel-tabset} ## R ```{r tech2-r, fig.width=9, fig.height=6} #| cache: true # ── Step 1: Extract fitted input data ──────────────────────────── qty <- incidents$qty_spilled_bbl[ !is.na(incidents$qty_spilled_bbl) & incidents$qty_spilled_bbl > 0 ] cost <- incidents$cost_usd[ !is.na(incidents$cost_usd) & incidents$cost_usd > 0 ] incidents_per_year <- incidents %>% count(year) %>% pull(n) # ── Step 2: Fit distributions ───────────────────────────────────── fit_qty <- fitdist(log(qty), "norm") fit_cost <- fitdist(cost, "gamma", lower = c(0, 0)) lambda <- mean(incidents_per_year) # ── Step 2b: Distribution fitting summary table ─────────────────── tibble::tibble( Parameter = c( "Non-zero spill volumes for fitting", "Non-zero cost records for fitting", "Poisson lambda (mean annual incidents)", "Lognormal mean (log scale)", "Lognormal SD (log scale)", "Gamma shape", "Gamma rate" ), Value = c( length(qty), length(cost), round(lambda, 1), round(fit_qty$estimate["mean"], 4), round(fit_qty$estimate["sd"], 4), round(fit_cost$estimate["shape"], 4), round(fit_cost$estimate["rate"], 6) ) ) %>% kableExtra::kable(caption = "Distribution Fitting Summary — Monte Carlo Inputs") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(1, bold = TRUE) # ── Step 3: Plot fitted distributions against empirical data ────── par(mfrow = c(1, 2)) denscomp(fit_qty, main = "Spill Volume: Lognormal Fit (log scale)") denscomp(fit_cost, main = "Remediation Cost: Gamma Fit") par(mfrow = c(1, 1)) # ── Step 4: Run Monte Carlo simulation — 10,000 iterations ─────── set.seed(42) N <- 10000 sim <- replicate(N, { k <- rpois(1, lambda = lambda) if (k == 0) return(c(spill_bbl = 0, cost_usd = 0)) spills <- exp(rnorm(k, mean = fit_qty$estimate["mean"], sd = fit_qty$estimate["sd"])) costs <- rgamma(k, shape = fit_cost$estimate["shape"], rate = fit_cost$estimate["rate"]) c(spill_bbl = sum(spills), cost_usd = sum(costs)) }) %>% t() %>% as_tibble() # ── Step 5: P10 / P50 / P90 summary table ──────────────────────── q <- quantile(sim$cost_usd, c(0.10, 0.50, 0.90)) percentile_table <- tibble( Percentile = c("P10 (Optimistic)", "P50 (Most Likely)", "P90 (Conservative Reserve)"), Annual_Cost = scales::dollar(q), Interpretation = c( "In 1-in-10 low-incident years the total bill stays below this figure", "Half of all simulated years produce costs below this — the central estimate", "Hold this amount in reserve: costs exceed this in only 1 in 10 years" ) ) kableExtra::kable(percentile_table, col.names = c("Percentile", "Annual Cost (USD)", "Plain-Language Meaning"), caption = "Table MC1: Monte Carlo Simulation — P10 / P50 / P90 Cost Outcomes") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(3, italic = TRUE) # ── Step 6: Histogram of simulated annual costs ─────────────────── ggplot(sim, aes(x = cost_usd)) + geom_histogram(bins = 80, fill = "#2c7fb8", colour = "white", linewidth = 0.2) + geom_vline(xintercept = q["10%"], linetype = "dashed", colour = "#27ae60", linewidth = 0.9) + geom_vline(xintercept = q["50%"], linetype = "dashed", colour = "#e67e22", linewidth = 0.9) + geom_vline(xintercept = q["90%"], linetype = "dashed", colour = "#e05c2a", linewidth = 1.2) + annotate("text", x = q["10%"], y = Inf, vjust = 1.5, hjust = -0.1, label = paste0("P10\n", scales::dollar(q["10%"])), colour = "#27ae60", size = 3.2, fontface = "bold") + annotate("text", x = q["50%"], y = Inf, vjust = 1.5, hjust = -0.1, label = paste0("P50\n", scales::dollar(q["50%"])), colour = "#e67e22", size = 3.2, fontface = "bold") + annotate("text", x = q["90%"], y = Inf, vjust = 1.5, hjust = -0.1, label = paste0("P90\n", scales::dollar(q["90%"])), colour = "#e05c2a", size = 3.2, fontface = "bold") + scale_x_continuous(labels = scales::dollar_format()) + labs( title = "Monte Carlo Simulation: Distribution of Simulated Annual Remediation Costs", subtitle = glue::glue("10,000 simulation runs | Lambda = {round(lambda,1)} incidents/year | P90 reserve shown in orange-red"), x = "Simulated Annual Remediation Cost (USD)", y = "Number of Simulation Runs" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.minor = element_blank() ) # ── Step 7: Tornado chart — sensitivity analysis ────────────────── set.seed(42) run_fixed <- function(fix_qty = FALSE, fix_cost = FALSE, fix_lambda = FALSE) { replicate(N, { k <- if (fix_lambda) round(lambda) else rpois(1, lambda = lambda) if (k == 0) return(0) sp <- if (fix_qty) rep(exp(fit_qty$estimate["mean"]), k) else exp(rnorm(k, fit_qty$estimate["mean"], fit_qty$estimate["sd"])) co <- if (fix_cost) rep(fit_cost$estimate["shape"] / fit_cost$estimate["rate"], k) else rgamma(k, shape = fit_cost$estimate["shape"], rate = fit_cost$estimate["rate"]) sum(co) }) %>% quantile(0.90) } base_p90 <- q["90%"] p90_fix_qty <- run_fixed(fix_qty = TRUE) p90_fix_cost <- run_fixed(fix_cost = TRUE) p90_fix_lam <- run_fixed(fix_lambda = TRUE) tornado_df <- tibble( Input = c("Per-incident spill volume\n(Lognormal)", "Per-incident remediation cost\n(Gamma)", "Annual incident count\n(Poisson)"), P90_reduction = c( base_p90 - p90_fix_qty, base_p90 - p90_fix_cost, base_p90 - p90_fix_lam ) ) %>% arrange(desc(P90_reduction)) %>% mutate(Input = forcats::fct_reorder(Input, P90_reduction)) ggplot(tornado_df, aes(x = P90_reduction, y = Input, fill = P90_reduction > 0)) + geom_col(width = 0.5) + geom_text(aes(label = scales::dollar(P90_reduction, accuracy = 1)), hjust = -0.15, fontface = "bold", size = 3.5) + scale_fill_manual(values = c("TRUE" = "#e05c2a", "FALSE" = "#2c7fb8")) + scale_x_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.2))) + labs( title = "Tornado Chart: Reduction in P90 Cost When Each Input is Held Fixed", subtitle = "The largest bar identifies the input driving the most uncertainty in the P90 reserve estimate", x = "Reduction in P90 Annual Cost (USD)", y = NULL ) + theme_minimal(base_size = 12) + theme( legend.position = "none", plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank() ) ``` ## Python ```{python tech2-py, fig.width=9, fig.height=6} #| cache: true import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.ticker as mticker from scipy import stats inc = r.incidents_py qty = pd.to_numeric(inc["qty_spilled_bbl"], errors="coerce").dropna() qty = qty[qty > 0].values cost = pd.to_numeric(inc["cost_usd"], errors="coerce").dropna() cost = cost[cost > 0].values lam = inc.groupby("year").size().mean() print(f"Non-zero spill volumes: {len(qty):,}") print(f"Non-zero cost records: {len(cost):,}") print(f"Mean annual incidents (lambda): {lam:.1f}") # ── Step 1: Fit distributions ───────────────────────────────────── mu_log = np.log(qty).mean() sd_log = np.log(qty).std() shape, loc_g, scale_g = stats.gamma.fit(cost, floc=0) print(f"\nLognormal fit — mu: {mu_log:.3f}, sigma: {sd_log:.3f}") print(f"Gamma fit — shape: {shape:.3f}, scale: {scale_g:.3f}") # ── Step 2: Plot fitted distributions ──────────────────────────── fig, axes = plt.subplots(1, 2, figsize=(12, 4)) x_qty = np.linspace(qty.min(), np.percentile(qty, 99), 300) axes[0].hist(qty, bins=60, density=True, color="#a8c8e8", edgecolor="white", alpha=0.7, label="Observed") axes[0].plot(x_qty, stats.lognorm.pdf(x_qty, s=sd_log, scale=np.exp(mu_log)), color="#2c7fb8", linewidth=2, label="Lognormal fit") axes[0].set_title("Spill Volume: Lognormal Fit", fontweight="bold") axes[0].set_xlabel("Barrels") axes[0].legend() x_cost = np.linspace(cost.min(), np.percentile(cost, 99), 300) axes[1].hist(cost, bins=40, density=True, color="#a8c8e8", edgecolor="white", alpha=0.7, label="Observed") axes[1].plot(x_cost, stats.gamma.pdf(x_cost, a=shape, loc=0, scale=scale_g), color="#e05c2a", linewidth=2, label="Gamma fit") axes[1].set_title("Remediation Cost: Gamma Fit", fontweight="bold") axes[1].set_xlabel("USD") axes[1].legend() plt.suptitle("Fitted Probability Distributions for Monte Carlo Inputs", fontsize=12, fontweight="bold") plt.tight_layout() plt.show() # ── Step 3: Monte Carlo simulation — 10,000 runs ───────────────── rng = np.random.default_rng(42) N = 10_000 sim_cost = [] sim_spill = [] for _ in range(N): k = rng.poisson(lam) if k == 0: sim_cost.append(0) sim_spill.append(0) else: spills = stats.lognorm.rvs(s=sd_log, scale=np.exp(mu_log), size=k, random_state=rng) costs = stats.gamma.rvs(a=shape, loc=0, scale=scale_g, size=k, random_state=rng) sim_cost.append(costs.sum()) sim_spill.append(spills.sum()) sim_cost = pd.Series(sim_cost) sim_spill = pd.Series(sim_spill) # ── Step 4: P10 / P50 / P90 table ──────────────────────────────── p10, p50, p90 = sim_cost.quantile([0.10, 0.50, 0.90]) summary = pd.DataFrame({ "Percentile" : ["P10 (Optimistic)", "P50 (Most Likely)", "P90 (Reserve)"], "Annual Cost" : [f"${p10:,.0f}", f"${p50:,.0f}", f"${p90:,.0f}"] }) print("\nMonte Carlo P10 / P50 / P90 Results:") print(summary.to_string(index=False)) # ── Step 5: Histogram with percentile markers ───────────────────── fig, ax = plt.subplots(figsize=(10, 5)) ax.hist(sim_cost, bins=80, color="#2c7fb8", edgecolor="white", linewidth=0.2, label="Simulated annual cost") for val, label, col in zip([p10, p50, p90], ["P10", "P50", "P90"], ["#27ae60", "#e67e22", "#e05c2a"]): ax.axvline(val, linestyle="--", color=col, linewidth=1.5) ax.text(val, ax.get_ylim()[1] * 0.92, f"{label}\n${val:,.0f}", color=col, fontsize=8, fontweight="bold", ha="left") ax.xaxis.set_major_formatter(mticker.FuncFormatter( lambda x, _: f"${x:,.0f}" )) ax.set_title( "Monte Carlo Simulation: Simulated Annual Remediation Cost Distribution\n(10,000 runs, Python)", fontsize=11, fontweight="bold") ax.set_xlabel("Annual Remediation Cost (USD)") ax.set_ylabel("Simulation Runs") plt.tight_layout() plt.show() # ── Step 6: Tornado chart ───────────────────────────────────────── def run_p90_fixed(fix_qty=False, fix_cost=False, fix_lam=False, seed=99): rng2 = np.random.default_rng(seed) results = [] for _ in range(N): k = round(lam) if fix_lam else rng2.poisson(lam) if k == 0: results.append(0) continue sp = (np.full(k, np.exp(mu_log)) if fix_qty else stats.lognorm.rvs(s=sd_log, scale=np.exp(mu_log), size=k, random_state=rng2)) co = (np.full(k, shape * scale_g) if fix_cost else stats.gamma.rvs(a=shape, loc=0, scale=scale_g, size=k, random_state=rng2)) results.append(co.sum()) return pd.Series(results).quantile(0.90) p90_base = p90 p90_fix_qty = run_p90_fixed(fix_qty=True) p90_fix_cost = run_p90_fixed(fix_cost=True) p90_fix_lam = run_p90_fixed(fix_lam=True) tornado = pd.DataFrame({ "Input": ["Per-incident spill volume\n(Lognormal)", "Per-incident remediation cost\n(Gamma)", "Annual incident count\n(Poisson)"], "P90_reduction": [p90_base - p90_fix_qty, p90_base - p90_fix_cost, p90_base - p90_fix_lam] }).sort_values("P90_reduction", ascending=True) fig, ax = plt.subplots(figsize=(9, 4)) colors = ["#e05c2a" if v > 0 else "#2c7fb8" for v in tornado["P90_reduction"]] ax.barh(tornado["Input"], tornado["P90_reduction"], color=colors, height=0.5) for i, (val, _) in enumerate( zip(tornado["P90_reduction"], tornado["Input"])): ax.text(val + abs(tornado["P90_reduction"].max()) * 0.02, i, f"${val:,.0f}", va="center", fontsize=9, fontweight="bold") ax.set_title("Tornado Chart: Which Input Drives the Most P90 Uncertainty?", fontsize=11, fontweight="bold") ax.set_xlabel("Reduction in P90 Annual Cost (USD)") ax.xaxis.set_major_formatter( mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")) plt.tight_layout() plt.show() ``` ::: --- ## 6.4 Management Interpretation **What the simulation is doing in plain terms** Think of the simulation as running 10,000 imaginary versions of the next year at OERNL. In each version, the number of incidents, the volume of each spill, and the cost of each cleanup are all allowed to vary randomly but within the bounds that history says is realistic. Some of those imaginary years are quiet, with few incidents and low costs. Others are expensive, with many incidents and at least one large spill. When all 10,000 years are lined up from cheapest to most expensive, the bill at any chosen probability level can be read directly from the distribution. **What the three numbers mean** The simulation produces three numbers that each serve a different planning purpose: - **P10 (Optimistic):** The cost in a genuinely good year where incidents are fewer than average and none are catastrophic. Costs fall below this level in only one in ten simulated years. It is not a sensible number to budget against because it assumes conditions better than what history typically delivers - **P50 (Most Likely):** The most honest central estimate. Half of all simulated years cost less than this and half cost more. If a single number is required for a planning document, this is the right one to use - **P90 (Conservative Reserve):** The recommended budget reserve. Holding this amount in the remediation contingency fund means that in nine out of every ten years the actual bill will be fully covered. Costs exceed this level only in genuinely extreme years, the kind driven by a large sabotage event coinciding with difficult weather or extended shutdown Presenting the P90 figure to the Board as the contingency reserve is a more defensible position than any single-point estimate, because it comes with an explicit and honest statement of the probability attached to it. **What the tornado chart tells us** The tornado chart identifies which of the three uncertain inputs is responsible for the most spread in the annual cost estimate: - The input with the longest bar is the one that, if better controlled or better measured, would most reduce uncertainty in the budget - If per-incident remediation cost is the dominant driver, the priority action is improving cost capture in JIV reports so future simulations are fitted on a broader and more representative cost sample - If annual incident count is the dominant driver, the priority action is operational controls that reduce incident frequency rather than data collection improvements This finding has a direct implication for where the department should invest its data quality effort ahead of the 2027 planning cycle. **Recommendation feeding into Section 10** The P90 figure from this simulation is the recommended remediation contingency reserve for inclusion in the 2026 annual operating plan. The tornado chart identifies the single input that most warrants tighter data collection or operational control heading into 2026. Both outputs are carried forward into the integrated recommendation in Section 10. # 7. Technique 3: Advanced Forecasting with Prophet ## 7.1 Technique Overview Time series analysis is a forecasting methodology that seeks to predict what will happen in the future based on what happened in the past, by identifying patterns in data collected at uniformly spaced time intervals (Marr, 2017). The patterns that time series methods seek to identify fall into two categories: - **Trends:** Systematic directional changes over time that do not repeat - **Seasonality:** Patterns that repeat at regular intervals such as monthly or annual cycles Prophet, developed by Taylor and Letham (2018) at Meta, is an advanced decomposable time series model built on an additive regression framework. It decomposes the observed series into three components: - `r icon(ic_wave)` **Trend component:** A piecewise linear or logistic growth curve that models the overall direction of the series over time, with automatic detection of changepoints where the trend rate shifts - `r icon(ic_calendar)` **Seasonality component:** Fourier series representations of periodic patterns at annual, weekly, and daily frequencies, fitted to the data without requiring the analyst to pre-specify the seasonal period - `r icon(ic_chartbar)` **Error component:** The residual variation not explained by trend or seasonality, assumed to be normally distributed with mean zero Prophet's key advantage over classical approaches such as Autoregressive Integrated Moving Average (ARIMA) is its robustness to missing data, outliers, and structural breaks in the trend. All of these are characteristic of pipeline incident data, where reporting gaps, operational shutdowns, and one-off crisis periods can distort the series. Prophet also requires no manual stationarity transformation or differencing, making it more accessible to non-specialist analysts while remaining statistically rigorous. Model reliability is assessed through walk-forward cross-validation, a procedure in which the model is repeatedly trained on an expanding window of historical data and evaluated on a held-out forecast horizon. The Root Mean Square Error (RMSE) across folds provides an honest estimate of forecast accuracy on unseen data, rather than in-sample fit. --- ## 7.2 Operational Justification Monthly pipeline incident counts at OERNL follow a time series structure: they are collected at uniformly spaced monthly intervals, they exhibit year-on-year trends driven by operational changes and security conditions, and they are likely to contain seasonal patterns linked to weather, community activity, and regulatory inspection cycles in the Niger Delta. Three operational decisions make a six-month forward forecast directly useful for the 2026 planning cycle: - `r icon(ic_wave)` **Emergency response capacity sizing:** The forecast provides a quantified expectation of monthly incident volumes over the first half of 2026, which determines the minimum manpower, containment boom inventory, and recovery vessel deployment that should be in place at the start of the year - `r icon(ic_calendar)` **Contractor mobilisation scheduling:** If the forecast identifies a peak period in Q2 or Q3 2026, contractor mobilisation can be advanced by the required lead time rather than being reactive to the actual incident spike - `r icon(ic_check)` **Budget phasing:** Monthly forecast confidence intervals allow the operations budget to be phased across the year in line with expected activity, rather than being allocated uniformly across twelve months regardless of seasonal variation This section addresses sub-question (c) of the central analytical question: what is the projected monthly incident volume over the next six months, and what does this imply for 2026 emergency response capacity? --- ## 7.3 Analysis and Outputs ```{r tech3-setup, echo=FALSE} #| message: false #| warning: false library(prophet) library(forecast) ``` ::: {.panel-tabset} ## R ```{r tech3-r, fig.width=10, fig.height=6} #| cache: true # ── Step 1: Aggregate to monthly time series ───────────────────── monthly <- incidents %>% filter(!is.na(year_month)) %>% group_by(year_month) %>% summarise( spill_bbl = sum(qty_spilled_bbl, na.rm = TRUE), n_incidents = n(), .groups = "drop" ) %>% arrange(year_month) tibble::tibble( Parameter = c( "Monthly periods available", "Date range start", "Date range end", "Mean monthly incidents", "Max monthly incidents" ), Value = c( nrow(monthly), format(min(monthly$year_month), "%b %Y"), format(max(monthly$year_month), "%b %Y"), round(mean(monthly$n_incidents), 1), max(monthly$n_incidents) ) ) %>% kableExtra::kable(caption = "Monthly Incident Series Summary") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(1, bold = TRUE) # ── Step 2: Plot raw monthly series ────────────────────────────── ggplot(monthly, aes(x = year_month, y = n_incidents)) + geom_line(colour = "#2c7fb8", linewidth = 0.8) + geom_point(colour = "#2c7fb8", size = 1.5) + geom_smooth(method = "loess", se = TRUE, colour = "#e05c2a", fill = "#f4c6b0", linewidth = 0.8, alpha = 0.3) + scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") + labs( title = "Monthly Pipeline Incident Count (2019 to 2025)", subtitle = "Raw series with LOESS trend overlay — input to Prophet forecasting model", x = NULL, y = "Number of Incidents" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), axis.text.x = element_text(angle = 45, hjust = 1), panel.grid.minor = element_blank() ) # ── Step 3: Fit Prophet model ───────────────────────────────────── df_p <- monthly %>% transmute(ds = as.POSIXct(year_month), y = n_incidents) mod <- prophet( df_p, yearly.seasonality = TRUE, weekly.seasonality = FALSE, daily.seasonality = FALSE, interval.width = 0.90 ) # ── Step 4: Generate 6-month forecast ──────────────────────────── fut <- make_future_dataframe(mod, periods = 6, freq = "month") fcst <- predict(mod, fut) p1 <- plot(mod, fcst) + ggtitle("Monthly Incident Count: Prophet Forecast (6-Month Horizon)") + labs(subtitle = "Black dots = observed | Blue line = fitted and forecast | Shaded band = 90% prediction interval") + theme_minimal(base_size = 11) + theme(plot.title = element_text(face = "bold")) print(p1) prophet_plot_components(mod, fcst) # ── Step 5: Extract and display forecast table ──────────────────── forecast_out <- fcst %>% filter(ds > max(df_p$ds)) %>% transmute( Month = format(as.Date(ds), "%B %Y"), Forecast = round(yhat, 1), Lower_90 = round(yhat_lower, 1), Upper_90 = round(yhat_upper, 1) ) kableExtra::kable(forecast_out, col.names = c("Month", "Forecast (incidents)", "Lower 90% CI", "Upper 90% CI"), caption = "Table FC1: Six-Month Forward Forecast of Monthly Incident Count") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8") # ── Step 6: Walk-forward cross-validation ──────────────────────── cv <- cross_validation( mod, initial = 365 * 3, period = 90, horizon = 180, units = "days" ) pm <- performance_metrics(cv) cv_summary <- tibble::tibble( Metric = c("Mean RMSE across CV folds", "Mean MAPE across CV folds", "Mean MAE across CV folds"), Value = c( paste0(round(mean(pm$rmse, na.rm = TRUE), 2), " incidents/month"), paste0(round(mean(pm$mape, na.rm = TRUE) * 100, 1), "%"), paste0(round(mean(pm$mae, na.rm = TRUE), 2), " incidents/month") ) ) kableExtra::kable(cv_summary, caption = "Walk-Forward Cross-Validation Performance Summary") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(1, bold = TRUE) %>% kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8") # ── Step 7: Plot RMSE over forecast horizon ─────────────────────── pm %>% mutate(horizon_days = as.numeric(horizon, units = "days")) %>% ggplot(aes(x = horizon_days, y = rmse)) + geom_line(colour = "#2c7fb8", linewidth = 0.9) + geom_point(colour = "#2c7fb8", size = 2) + geom_smooth(method = "loess", se = FALSE, colour = "#e05c2a", linewidth = 0.7, linetype = "dashed") + labs( title = "Walk-Forward Cross-Validation: RMSE by Forecast Horizon", subtitle = "RMSE increasing with horizon is expected — confirms model degrades gracefully rather than catastrophically", x = "Forecast Horizon (Days)", y = "RMSE (Incidents per Month)" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.minor = element_blank() ) ``` ## Python ```{python tech3-py, fig.width=10, fig.height=6} #| cache: true import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as mticker from prophet import Prophet from prophet.diagnostics import cross_validation, performance_metrics # ── Step 1: Aggregate to monthly series ────────────────────────── inc_py = r.incidents_py inc_py["year_month_dt"] = pd.to_datetime( inc_py["year_month_str"], errors="coerce" ) monthly_py = ( inc_py.dropna(subset=["year_month_dt"]) .groupby("year_month_dt") .agg( spill_bbl = ("qty_spilled_bbl", "sum"), n_incidents = ("ref_no", "count") ) .reset_index() .rename(columns={"year_month_dt": "ds", "n_incidents": "y"}) .sort_values("ds") ) print(f"Monthly periods available: {len(monthly_py)}") print(f"Date range: {monthly_py['ds'].min().strftime('%b %Y')} " f"to {monthly_py['ds'].max().strftime('%b %Y')}") print(f"Mean monthly incidents: {monthly_py['y'].mean():.1f}") # ── Step 2: Plot raw monthly series ────────────────────────────── fig, ax = plt.subplots(figsize=(11, 4)) ax.plot(monthly_py["ds"], monthly_py["y"], color="#2c7fb8", linewidth=0.9, marker="o", markersize=3) ax.set_title("Monthly Pipeline Incident Count (2019 to 2025)", fontsize=12, fontweight="bold") ax.set_xlabel(None) ax.set_ylabel("Number of Incidents") ax.xaxis.set_major_locator( plt.matplotlib.dates.MonthLocator(bymonth=[1, 7]) ) ax.xaxis.set_major_formatter( plt.matplotlib.dates.DateFormatter("%b %Y") ) plt.xticks(rotation=45) plt.tight_layout() plt.show() # ── Step 3: Fit Prophet model ───────────────────────────────────── m = Prophet( yearly_seasonality = True, weekly_seasonality = False, daily_seasonality = False, interval_width = 0.90 ) m.fit(monthly_py[["ds", "y"]]) # ── Step 4: Generate 6-month forecast ──────────────────────────── future = m.make_future_dataframe(periods=6, freq="MS") fcst_py = m.predict(future) fig1 = m.plot(fcst_py) fig1.suptitle( "Monthly Incident Count: Prophet Forecast (6-Month Horizon)\n" "Black dots = observed | Blue = fitted/forecast | Shaded = 90% CI", fontsize=10, fontweight="bold", y=1.02) plt.tight_layout() plt.show() fig2 = m.plot_components(fcst_py) fig2.suptitle("Prophet Model Components: Trend and Seasonality", fontsize=10, fontweight="bold", y=1.02) plt.tight_layout() plt.show() # ── Step 5: Forecast table ──────────────────────────────────────── fwd = fcst_py[fcst_py["ds"] > monthly_py["ds"].max()][ ["ds", "yhat", "yhat_lower", "yhat_upper"] ].copy() fwd["ds"] = fwd["ds"].dt.strftime("%B %Y") fwd.columns = ["Month", "Forecast", "Lower 90% CI", "Upper 90% CI"] fwd = fwd.round(1) print("\nSix-Month Forward Forecast:") print(fwd.to_string(index=False)) # ── Step 6: Walk-forward cross-validation ──────────────────────── cv_py = cross_validation( m, initial = "1095 days", period = "90 days", horizon = "180 days" ) pm_py = performance_metrics(cv_py) print("\nWalk-forward CV performance metrics (first rows):") print(pm_py[["horizon", "rmse", "mae", "mape"]] .head(8).to_string(index=False)) print(f"\nMean RMSE across folds: {pm_py['rmse'].mean():.2f} incidents/month") print(f"Mean MAPE across folds: {pm_py['mape'].mean()*100:.1f}%") # ── Step 7: RMSE over horizon plot ─────────────────────────────── pm_py["horizon_days"] = pm_py["horizon"].dt.days fig, ax = plt.subplots(figsize=(9, 4)) ax.plot(pm_py["horizon_days"], pm_py["rmse"], color="#2c7fb8", linewidth=1.2, marker="o", markersize=4) ax.set_title("Walk-Forward Cross-Validation: RMSE by Forecast Horizon", fontsize=11, fontweight="bold") ax.set_xlabel("Forecast Horizon (Days)") ax.set_ylabel("RMSE (Incidents per Month)") plt.tight_layout() plt.show() ``` ::: --- ## 7.4 Management Interpretation **What the forecast is and is not** The Prophet model has learned the historical pattern of monthly pipeline incidents at OERNL — how many tend to occur in a typical month, how the number has trended over seven years, and whether certain times of year consistently produce more or fewer incidents. It uses that learned pattern to project forward six months. It is important to be clear about what this forecast is and is not: - It is not a guarantee of what will happen. Pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can predict with certainty - It is an evidence-based expectation of what is most likely, combined with an honest statement of the uncertainty around that expectation expressed as a 90% confidence interval - It is most useful as a capacity planning tool rather than a precise prediction, because the upper bound of the confidence interval provides a more appropriate basis for resource sizing than the central estimate **How to read the forecast table** Each row in the forecast table gives three numbers for a future month: - **Central forecast:** The most likely outcome based on historical trend and seasonality patterns - **Lower 90% bound:** The floor that actual incident counts will exceed nine times out of ten - **Upper 90% bound:** The ceiling that actual counts will stay below nine times out of ten A wide band between the lower and upper bounds means the model is less certain about that month. This is honest, not a weakness. A narrow band means the pattern is stable and predictable. The upper bound is the number to use for capacity planning. **What the component plots reveal** The component decomposition plots provide two pieces of operational intelligence: - The **trend component** shows whether incident frequency has been rising, falling, or holding steady over seven years. An upward trend heading into 2026 is a direct signal that standing capacity from 2025 will be insufficient for the year ahead - The **seasonality component** shows which months historically produce elevated incident counts. If the peak months in the seasonality plot align with the highest forecast values in the table, that confirms the peak is structural and predictable. Contractor mobilisation should be timed to arrive before that peak, not after it **What the cross-validation result means** The RMSE from walk-forward cross-validation is the single most important diagnostic output in this section. It measures how wrong the model typically is when forecasting on historical data it has never seen before. A mean RMSE of 15.41 incidents per month means the model's monthly forecast is typically within 15 incidents of the actual count on unseen periods. This figure should be used directly in the capacity planning calculation. If the forecast for a given month is 20 incidents, the response capacity should be sized for approximately 35 incidents at the upper end of the confidence range, not exactly 20. Planning to the central estimate with no buffer is equivalent to planning for a median outcome while accepting a coin-flip chance of being under-resourced. **Recommendation feeding into Section 10** The six-month forecast directly determines the 2026 emergency response capacity recommendation in Section 10. The upper bound of the 90% confidence interval for the peak forecast month is the planning figure that should be used for manpower and equipment sizing. The cost of being under-resourced during a peak incident month exceeds the cost of holding excess capacity during a quiet one. # 8. Technique 4: Asset Survival Analysis ## 8.1 Technique Overview Survival analysis is a statistical method for modelling the time elapsed until an event of interest occurs. Originally developed in clinical and actuarial research, it has been widely adopted in engineering reliability analysis, customer churn analytics, and asset management — anywhere the central question is not simply whether an event will occur, but when (Kleinbaum and Klein, 2012). In the context of asset management, the technique is directly analogous to customer churn analytics: just as customer churn analytics assesses how long a customer stays before leaving (Marr, 2017), asset survival analysis assesses how long a facility remains incident-free between consecutive failures. This analysis applies the technique to inter-arrival times, specifically the number of days between consecutive incidents at the same facility, rather than time to first incident. This approach uses all 1,695 incident records rather than only the first event per facility, producing a richer and more statistically powerful dataset for modelling recurring failure patterns across the portfolio. Two complementary methods are applied: - `r icon(ic_listol)` **Kaplan-Meier (KM) estimator:** A non-parametric method that produces an empirical survival curve showing the probability that a facility remains incident-free beyond any given number of days after a previous incident. The KM estimator makes no assumptions about the underlying failure time distribution, which is appropriate given the diversity of facility types and operating conditions in the OERNL portfolio. Curves are stratified by facility category to reveal which asset classes exhibit the shortest inter-incident times - `r icon(ic_layers)` **Cox Proportional Hazards (Cox PH) model:** A semi-parametric regression model that estimates the effect of covariates — facility category and state — on the hazard rate, defined as the instantaneous risk of the next incident given that no incident has yet occurred since the last one. The model expresses results as hazard ratios: a hazard ratio above 1 for a covariate means that category or state is associated with a higher rate of recurrence relative to the reference group, controlling for all other variables. The proportional hazards assumption is verified using Schoenfeld residuals via the `cox.zph` test **Censoring** is handled correctly throughout. Facilities that have not experienced a subsequent incident by the end of the observation window contribute their observed inter-incident time as a right-censored observation, ensuring no data is discarded. --- ## 8.2 Operational Justification The current approach to pipeline integrity inspection prioritisation at OERNL relies on engineering judgement and periodic field reports, with no systematic basis for comparing recurrence risk across the 446 facilities in the portfolio. Inter-incident survival analysis addresses this gap in three specific ways: - `r icon(ic_listol)` **Recurrence-based prioritisation:** Facilities with the shortest median inter-incident time, those that fail most frequently relative to their own history, are flagged for priority inspection scheduling in the 2026 maintenance programme. This is more operationally meaningful than ranking by total incident count alone, because it identifies facilities where failure is accelerating rather than those that have simply been in service longest - `r icon(ic_layers)` **Covariate-adjusted hazard ranking:** The Cox model identifies which facility categories and states carry statistically significantly higher hazard rates after controlling for all other factors. This allows the maintenance budget to be allocated by evidence-based risk rather than by administrative area or operational familiarity - `r icon(ic_contract)` **Defensible business case:** Hazard ratios with confidence intervals provide statistically defensible evidence when presenting integrity intervention projects to Finance and the Board, replacing subjective risk assertions with quantified, model-derived estimates This section addresses sub-question (d) of the central analytical question: which facilities exhibit the shortest expected time to next incident, and which should be elevated to the top quartile of the 2026 integrity inspection schedule? --- ## 8.3 Analysis and Outputs ```{r tech4-setup, echo=FALSE} #| message: false #| warning: false library(survival) library(broom) # survminer intentionally not loaded — using ggplot2 directly # to avoid PNG device compatibility issues with ggsurvplot ``` ::: {.panel-tabset} ## R ```{r tech4-r, fig.width=10, fig.height=8} #| cache: true if ("survminer" %in% (.packages())) detach("package:survminer", unload = TRUE) # ── Step 1: Build inter-incident time dataset ───────────────────── tte <- incidents %>% filter(!is.na(facility), !is.na(date_incident)) %>% arrange(facility, date_incident) %>% group_by(facility) %>% mutate( tte_days = as.integer(date_incident - lag(date_incident)), event = 1L ) %>% ungroup() %>% filter(!is.na(tte_days), tte_days > 0) tibble::tibble( Metric = c("Inter-incident intervals available", "Unique facilities with repeat incidents", "Median inter-incident time (days)", "Mean inter-incident time (days)"), Value = c(nrow(tte), n_distinct(tte$facility), median(tte$tte_days, na.rm = TRUE), round(mean(tte$tte_days, na.rm = TRUE), 1)) ) %>% kableExtra::kable(caption = "Inter-Incident Time Dataset Summary") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(1, bold = TRUE) # ── Step 2: Distribution of inter-incident times ────────────────── ggplot(tte, aes(x = tte_days)) + geom_histogram(bins = 60, fill = "#2c7fb8", colour = "white", linewidth = 0.2) + geom_vline(xintercept = median(tte$tte_days, na.rm = TRUE), colour = "#e05c2a", linetype = "dashed", linewidth = 0.9) + annotate("text", x = median(tte$tte_days, na.rm = TRUE) + 5, y = Inf, vjust = 1.5, hjust = 0, label = paste0("Median = ", median(tte$tte_days, na.rm = TRUE), " days"), colour = "#e05c2a", fontface = "italic", size = 3.5) + scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, by = 30)) + labs( title = "Distribution of Inter-Incident Times (Days Between Consecutive Incidents)", subtitle = "Truncated at 365 days for readability", x = "Days Between Consecutive Incidents at Same Facility", y = "Number of Intervals" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold"), panel.grid.minor = element_blank() ) # ── Step 3: KM by facility category ────────────────────────────── top_cats <- tte %>% filter(!is.na(facility_cat)) %>% count(facility_cat, sort = TRUE) %>% slice_head(n = 5) %>% pull(facility_cat) tte_cats <- tte %>% filter(facility_cat %in% top_cats) km_cat <- survfit(Surv(tte_days, event) ~ facility_cat, data = tte_cats) km_cat_tidy <- broom::tidy(km_cat) %>% mutate(strata = stringr::str_remove(strata, "facility_cat=")) ggplot(km_cat_tidy, aes(x = time, y = estimate, colour = strata, fill = strata)) + geom_step(linewidth = 0.9) + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.08, colour = NA) + scale_colour_brewer(palette = "Set1") + scale_fill_brewer(palette = "Set1") + coord_cartesian(xlim = c(0, 500)) + labs( title = "Kaplan-Meier: Time Between Incidents by Facility Category", subtitle = "Steeper curve = shorter inter-incident time = higher recurrence risk", x = "Days Since Previous Incident", y = "Probability of Remaining Incident-Free", colour = "Facility Category", fill = "Facility Category" ) + theme_minimal(base_size = 11) + theme( plot.title = element_text(face = "bold"), legend.position = "bottom" ) # ── Step 4: KM by operational area ─────────────────────────────── tte_area <- tte %>% filter(area %in% c("LAR", "SAR")) km_area <- survfit(Surv(tte_days, event) ~ area, data = tte_area) km_area_tidy <- broom::tidy(km_area) %>% mutate(strata = stringr::str_remove(strata, "area=")) lr_test <- survdiff(Surv(tte_days, event) ~ area, data = tte_area) lr_pval <- round(1 - pchisq(lr_test$chisq, df = 1), 4) ggplot(km_area_tidy, aes(x = time, y = estimate, colour = strata, fill = strata)) + geom_step(linewidth = 0.9) + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.10, colour = NA) + scale_colour_manual(values = c("LAR" = "#2c7fb8", "SAR" = "#e05c2a")) + scale_fill_manual(values = c("LAR" = "#2c7fb8", "SAR" = "#e05c2a")) + coord_cartesian(xlim = c(0, 500)) + annotate("text", x = 400, y = 0.95, label = paste0("Log-rank p = ", lr_pval), size = 3.5, fontface = "italic", colour = "#333333") + labs( title = "Kaplan-Meier: Time Between Incidents by Operational Area", subtitle = "Log-rank test p-value tests whether survival difference is statistically significant", x = "Days Since Previous Incident", y = "Probability of Remaining Incident-Free", colour = "Area", fill = "Area" ) + theme_minimal(base_size = 11) + theme( plot.title = element_text(face = "bold"), legend.position = "bottom" ) # ── Step 5: Cox Proportional Hazards model ──────────────────────── cox_data <- tte %>% filter(!is.na(facility_cat), !is.na(state), facility_cat %in% top_cats) cox <- coxph( Surv(tte_days, event) ~ facility_cat + state, data = cox_data ) # ── Step 6: Hazard ratio table ──────────────────────────────────── cox_tidy <- broom::tidy(cox, exponentiate = TRUE, conf.int = TRUE) %>% mutate( term = stringr::str_remove(term, "facility_cat|state"), signif = case_when( p.value < 0.001 ~ "***", p.value < 0.01 ~ "**", p.value < 0.05 ~ "*", p.value < 0.10 ~ ".", TRUE ~ "" ) ) %>% dplyr::select(term, estimate, conf.low, conf.high, p.value, signif) %>% mutate(across(c(estimate, conf.low, conf.high), ~ round(.x, 3)), p.value = round(p.value, 4)) kableExtra::kable(cox_tidy, col.names = c("Covariate", "Hazard Ratio", "Lower 95% CI", "Upper 95% CI", "p-value", "Significance"), caption = "Table SA1: Cox PH Model — Hazard Ratios by Facility Category and State") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(2, bold = TRUE, color = "#2c7fb8") %>% kableExtra::footnote( general = "Reference: lowest-risk facility type and state. HR > 1 = higher recurrence hazard. *** p<0.001, ** p<0.01, * p<0.05" ) # ── Step 7: Proportional hazards assumption check ──────────────── ph_test <- cox.zph(cox) ph_df_summary <- as.data.frame(ph_test$table) %>% tibble::rownames_to_column("Covariate") %>% dplyr::rename( `Chi-squared` = chisq, `Degrees of Freedom` = df, `p-value` = p ) %>% dplyr::mutate( `Chi-squared` = round(`Chi-squared`, 3), `p-value` = round(`p-value`, 5), Result = ifelse(`p-value` > 0.05, "Assumption holds", "Potential violation") ) kableExtra::kable(ph_df_summary, caption = "Schoenfeld Residuals: Proportional Hazards Assumption Test") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE ) %>% kableExtra::column_spec(1, bold = TRUE) %>% kableExtra::column_spec(5, color = ifelse(ph_df_summary$`p-value` > 0.05, "#27ae60", "#e05c2a")) ph_mat <- ph_test$y ph_time <- ph_test$time if (!is.null(ph_mat) && nrow(ph_mat) > 0 && length(ph_time) > 0) { ph_df <- as.data.frame(ph_mat) %>% mutate(time = ph_time) %>% tidyr::pivot_longer( cols = -time, names_to = "covariate", values_to = "residual" ) %>% filter(!is.na(residual), !is.na(time)) print( ggplot(ph_df, aes(x = time, y = residual)) + geom_point(alpha = 0.3, size = 0.8, colour = "#2c7fb8") + geom_smooth(method = "loess", se = TRUE, colour = "#e05c2a", fill = "#f4c6b0", linewidth = 0.8) + geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") + facet_wrap(~ covariate, scales = "free_y", ncol = 2) + labs( title = "Schoenfeld Residuals: Proportional Hazards Assumption Check", subtitle = "Flat LOESS line = assumption satisfied | Trending line = assumption may be violated", x = "Time", y = "Scaled Schoenfeld Residual" ) + theme_minimal(base_size = 10) + theme(plot.title = element_text(face = "bold")) ) } else { cat("Schoenfeld residual plot skipped — insufficient data.\n") } # ── Step 8: High-risk facility summary table ────────────────────── high_risk <- tte %>% group_by(facility, state, area, facility_cat) %>% summarise( n_intervals = n(), median_tte_days = median(tte_days, na.rm = TRUE), mean_tte_days = round(mean(tte_days, na.rm = TRUE), 1), .groups = "drop" ) %>% filter(n_intervals >= 3) %>% arrange(median_tte_days) %>% slice_head(n = 20) %>% rename( Facility = facility, State = state, Area = area, `Facility Type` = facility_cat, `Repeat Intervals` = n_intervals, `Median Days Between Incidents` = median_tte_days, `Mean Days Between Incidents` = mean_tte_days ) kableExtra::kable(high_risk, caption = "Table SA2: Top 20 Highest-Recurrence Facilities (Minimum 3 Inter-Incident Intervals)") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(6, bold = TRUE, color = "#e05c2a") %>% kableExtra::row_spec(1:5, background = "#fef3f0") ``` ## Python ```{python tech4-py, fig.width=10, fig.height=7} #| cache: true import pandas as pd import numpy as np import matplotlib.pyplot as plt from lifelines import KaplanMeierFitter, CoxPHFitter from lifelines.statistics import logrank_test inc_py = r.incidents_py inc_py["date_incident_dt"] = pd.to_datetime( inc_py["date_incident_str"], errors="coerce" ) tte_py = ( inc_py.dropna(subset=["facility", "date_incident_dt"]) .sort_values(["facility", "date_incident_dt"]) .copy() ) tte_py["prev_incident"] = tte_py.groupby("facility")[ "date_incident_dt"].shift(1) tte_py["tte_days"] = ( tte_py["date_incident_dt"] - tte_py["prev_incident"]).dt.days tte_py = tte_py.dropna(subset=["tte_days"]) tte_py = tte_py[tte_py["tte_days"] > 0].copy() tte_py["event"] = 1 print(f"Inter-incident intervals available: {len(tte_py):,}") print(f"Unique facilities with repeat incidents: {tte_py['facility'].nunique():,}") print(f"Median inter-incident time: {tte_py['tte_days'].median():.0f} days") print(f"Mean inter-incident time: {tte_py['tte_days'].mean():.1f} days") # ── Step 2: KM curves by facility category ──────────────────────── top_cats_py = ( tte_py.dropna(subset=["facility_cat"]) .groupby("facility_cat").size() .sort_values(ascending=False) .head(5).index.tolist() ) fig, ax = plt.subplots(figsize=(10, 5)) kmf_py = KaplanMeierFitter() palette = ["#2c7fb8", "#e05c2a", "#27ae60", "#9b59b6", "#e67e22"] for i, cat in enumerate(top_cats_py): sub = tte_py[tte_py["facility_cat"] == cat] if len(sub) >= 5: kmf_py.fit(sub["tte_days"], event_observed=sub["event"], label=str(cat)) kmf_py.plot_survival_function( ax=ax, ci_show=False, color=palette[i % len(palette)]) ax.set_xlim(0, 500) ax.set_title("Kaplan-Meier: Time Between Incidents by Facility Category", fontsize=12, fontweight="bold") ax.set_xlabel("Days Since Previous Incident") ax.set_ylabel("Probability of Remaining Incident-Free") ax.grid(True, alpha=0.3) ax.legend(title="Facility Category", fontsize=8) plt.tight_layout() plt.show() # ── Step 3: KM by operational area with log-rank test ──────────── lar = tte_py[tte_py["area"] == "LAR"] sar = tte_py[tte_py["area"] == "SAR"] lr = logrank_test( lar["tte_days"], sar["tte_days"], event_observed_A=lar["event"], event_observed_B=sar["event"] ) fig, ax = plt.subplots(figsize=(10, 5)) for area, data, col in [("LAR", lar, "#2c7fb8"), ("SAR", sar, "#e05c2a")]: kmf_a = KaplanMeierFitter() kmf_a.fit(data["tte_days"], event_observed=data["event"], label=area) kmf_a.plot_survival_function(ax=ax, ci_show=True, color=col, alpha=0.8) ax.set_xlim(0, 500) ax.set_title( "Kaplan-Meier: Time Between Incidents by Operational Area (LAR vs SAR)", fontsize=12, fontweight="bold") ax.set_xlabel("Days Since Previous Incident") ax.set_ylabel("Probability of Remaining Incident-Free") ax.text(0.98, 0.98, f"Log-rank p = {lr.p_value:.4f}", transform=ax.transAxes, ha="right", va="top", fontsize=10, bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8)) ax.grid(True, alpha=0.3) ax.legend() plt.tight_layout() plt.show() print(f"\nLog-rank LAR vs SAR: statistic={lr.test_statistic:.3f}, " f"p={lr.p_value:.4f}") # ── Step 4: Cox Proportional Hazards model ──────────────────────── cox_data_py = ( tte_py[["tte_days", "event", "facility_cat", "state"]] .dropna() .query("facility_cat in @top_cats_py") .copy() ) cox_dummies = pd.get_dummies( cox_data_py, columns=["facility_cat", "state"], drop_first=True ) cph = CoxPHFitter() cph.fit(cox_dummies, duration_col="tte_days", event_col="event") print("\nCox PH Model Summary:") cph.print_summary() fig, ax = plt.subplots(figsize=(9, 5)) cph.plot(ax=ax) ax.set_title( "Cox PH Model: Hazard Ratios by Facility Category and State", fontsize=11, fontweight="bold") ax.axvline(0, linestyle="--", color="grey", linewidth=0.8) plt.tight_layout() plt.show() # ── Step 5: High-recurrence facility table ──────────────────────── high_risk_py = ( tte_py.groupby(["facility", "state", "area", "facility_cat"]) .agg( n_intervals = ("tte_days", "count"), median_tte_days = ("tte_days", "median"), mean_tte_days = ("tte_days", "mean") ) .reset_index() .query("n_intervals >= 3") .sort_values("median_tte_days") .head(20) .round(1) ) print("\nTop 20 Highest-Recurrence Facilities:") print(high_risk_py.to_string(index=False)) ``` ::: --- ## 8.4 Management Interpretation **What inter-incident time means and why it matters more than incident count** The total number of incidents at a facility tells you how much has gone wrong historically. The time between incidents tells you how quickly things go wrong again after a repair or intervention. Consider two facilities, each with ten incidents over seven years: - Facility A experiences incidents spread evenly at roughly ten-month intervals — a low but steady recurrence pattern suggesting manageable wear and predictable maintenance needs - Facility B experiences all ten incidents within eighteen months, followed by a long quiet period — a structural vulnerability that burst into recurring failure before being addressed Survival analysis separates these two profiles explicitly. It is Facility B that should command priority in the 2026 inspection programme, not simply the facility with the highest total count. **What the Kaplan-Meier curves reveal** Each curve shows the probability that a facility of a given type or in a given area will remain incident-free as time passes after a previous incident. Reading the curves operationally: - A curve that drops to 50% within 60 days means half of all incidents at that facility type are followed by another incident within two months. This is a structural recurrence pattern - A curve that remains above 80% for 200 or more days means the asset class is generally resilient and re-incidents are relatively infrequent - The gap between the LAR and SAR curves, combined with the log-rank p-value, determines whether one operational area carries a structurally higher recurrence risk than the other and should therefore receive a larger share of inspection resources **What the Cox model adds** The Kaplan-Meier curves show what is happening visually. The Cox model quantifies the effect of each covariate while holding all others constant. Three practical implications follow from the hazard ratio table: - A hazard ratio above 1.0 for a facility category means that asset class re-incidents faster than the reference category, all else being equal. The magnitude of the ratio indicates how much faster - A statistically significant ratio (p less than 0.05) means the difference is unlikely to be explained by chance and can be used to justify differentiated budget allocation - A non-significant ratio means the difference in recurrence rate between that category and the reference is not reliably distinguishable from random variation in the data **What the proportional hazards test confirms** The Schoenfeld residuals test checks whether the hazard ratios are stable over time. A non-significant p-value for each covariate confirms that the ratios hold across the full 2019 to 2025 observation window and can be applied to the 2026 planning horizon with confidence. If any covariate fails the test, its hazard ratio should be interpreted with caution and a time-varying coefficient model considered as an extension in future work. **Recommendation feeding into Section 10** Two concrete outputs from this section feed into the integrated recommendation: - The facilities in the bottom quartile of median inter-incident time from Table SA2 constitute the mandatory inclusion list for the 2026 Q1 integrity inspection programme - The facility category and state with the highest statistically significant hazard ratio from the Cox model should receive a proportionally larger share of the integrity inspection budget relative to lower-hazard asset classes # 9. Technique 5: Association Rules (Operational Co-occurrence Analysis) ## 9.1 Technique Overview Association rule mining is a data mining technique designed to discover interesting relationships, co-occurrence patterns, and dependencies between variables in large transactional datasets (Marr, 2017; Agrawal and Srikant, 1994). Originally developed for market basket analysis — identifying which products customers tend to purchase together — the technique has broad applicability wherever the goal is to find combinations of attributes that co-occur more frequently than chance would predict. In an operational context, it answers the question: given that an incident has a particular cause, location, and facility type, what else is reliably true about it? The Apriori algorithm, the standard implementation used here, works in three stages: 1. **Identify frequent itemsets:** All attribute combinations that meet a minimum frequency threshold (support) are retained. The minimum support of 0.02 used here ensures each rule covers at least 34 of the 1,695 incidents, providing a statistically reliable foundation 2. **Evaluate confidence:** For each frequent itemset, the algorithm assesses how reliably the antecedent attributes predict the consequent attribute. Only combinations above the minimum confidence threshold of 0.60 are retained as rules 3. **Rank by lift:** Rules are ranked by how much more likely the consequent is given the antecedent than it would be by chance alone. Lift is the primary operational metric because it identifies genuinely non-obvious co-occurrences rather than simply frequent ones Three metrics define every rule: - `r icon(ic_database)` **Support:** The proportion of all incidents in which the complete rule, both antecedent and consequent, appears together. A support of 0.05 means the combination appears in at least 5% of all 1,695 incidents. Low support rules may be statistically unreliable and are excluded by the minimum threshold - `r icon(ic_chartbar)` **Confidence:** The conditional probability that the consequent appears given the antecedent. A confidence of 0.80 for the rule {Facility: Pipeline (Oil), Area: LAR} → {Sabotage} means that in 80% of incidents where the facility is an oil pipeline in the Lagos Area, the incident is classified as Sabotage. High confidence identifies reliable operational patterns - `r icon(ic_shield)` **Lift:** The ratio of the observed confidence to the confidence that would be expected if the antecedent and consequent were statistically independent. A lift of 3.0 means the consequent is three times more likely given the antecedent than it would be by chance. A lift of 1.0 means the rule adds no predictive value beyond the base rate The minimum lift threshold of 1.2 and the focus on rules with Sabotage as the consequent are both operationally motivated. The goal is to identify attribute combinations that are meaningfully predictive of the incident type most damaging to production continuity. --- ## 9.2 Operational Justification Sabotage accounts for 74% of all incidents in the OERNL dataset, but not all sabotage incidents are equivalent in their operational impact. The structured fields — cause, location, facility category, area, state — are currently reviewed individually during incident investigations. They have never been analysed simultaneously to identify which combinations of attributes reliably co-occur with high-severity sabotage outcomes. Association rule mining addresses this gap in three ways: - `r icon(ic_brain)` **Security targeting:** Rules with high lift linking specific cause types and locations to Sabotage classification identify the geographic and operational contexts where third-party interference is most concentrated. These rules feed directly into the security patrol schedule and community engagement targeting strategy for 2026 - `r icon(ic_wrench)` **Project scope refinement:** Rules linking facility category and cause type to cleanup requirements or production deferment inform the scope of capital maintenance projects. If a high-lift rule identifies that hacksaw cuts at oil pipelines in Bayelsa reliably require cleanup, that combination should be explicitly scoped in the Kwale Flaredown and Brass remediation project budgets rather than treated as a contingency item - `r icon(ic_clipboard)` **Emergency response protocol design:** Frequent itemsets linking incident type combinations to response actions allow the department to pre-position resources and design differentiated response protocols rather than applying a single generic response to all sabotage incidents regardless of their specific attribute profile This section addresses sub-question (e) of the central analytical question: which combinations of cause type, location, and facility category are most strongly associated with Sabotage classification, and how should these combinations inform the 2026 security and community engagement targeting strategy? --- ## 9.3 Analysis and Outputs ```{r tech5-setup, echo=FALSE} #| message: false #| warning: false library(arules) library(arulesViz) library(visNetwork) ``` ::: {.panel-tabset} ## R ```{r tech5-r, fig.width=10, fig.height=8} #| cache: true #| message: false #| warning: false # ── Step 1: Build transaction dataset ──────────────────────────── tx <- incidents %>% dplyr::transmute( cause = stringr::word(stringr::str_to_title( stringr::str_squish(cause)), 1, 4), facility_cat = as.character(facility_cat), area = as.character(area), state = as.character(state), cleanup = as.character(cleanup_required), category = as.character(category) ) %>% mutate(across(everything(), ~ ifelse(is.na(.x) | .x == "NA", "Unknown", .x))) %>% mutate( cause = paste0("cause=", cause), facility_cat = paste0("fac=", facility_cat), area = paste0("area=", area), state = paste0("state=", state), cleanup = paste0("cleanup=", cleanup), category = paste0("category=", category) ) # ── Step 2: Convert to transactions object ──────────────────────── trans <- as(tx, "transactions") # ── Step 3: Inspect most frequent items ────────────────────────── item_freq <- itemFrequency(trans, type = "relative") top_items <- sort(item_freq, decreasing = TRUE)[1:20] clean_item_label <- function(x) { x %>% stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>% stringr::str_replace_all("cause=cause=", "Cause: ") %>% stringr::str_replace_all("area=area=", "Area: ") %>% stringr::str_replace_all("state=state=", "State: ") %>% stringr::str_replace_all("cleanup=cleanup=", "Cleanup: ") %>% stringr::str_replace_all("category=category=", "Category: ") %>% stringr::str_replace_all("fac=", "Facility: ") %>% stringr::str_replace_all("cause=", "Cause: ") %>% stringr::str_replace_all("area=", "Area: ") %>% stringr::str_replace_all("state=", "State: ") %>% stringr::str_replace_all("cleanup=", "Cleanup: ") %>% stringr::str_replace_all("category=", "Category: ") } top_items_df <- tibble::tibble( item = clean_item_label(names(top_items)), frequency = as.numeric(top_items) ) ggplot(top_items_df, aes(x = frequency, y = forcats::fct_reorder(item, frequency))) + geom_col(fill = "#2c7fb8", width = 0.7) + geom_text(aes(label = scales::percent(frequency, accuracy = 0.1)), hjust = -0.1, size = 3, fontface = "bold") + scale_x_continuous(labels = scales::percent, expand = expansion(mult = c(0, 0.15))) + labs( title = "Top 20 Most Frequent Items Across All Incident Transactions", subtitle = "Item frequency as proportion of all 1,695 incidents", x = "Support (Proportion of Incidents)", y = NULL ) + theme_minimal(base_size = 11) + theme( plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank() ) # ── Step 4: Mine association rules ──────────────────────────────── rules <- apriori( trans, parameter = list( supp = 0.02, conf = 0.60, minlen = 2, maxlen = 4 ), control = list(verbose = FALSE) ) # ── Step 5: Filter to rules with Sabotage as consequent ────────── sabotage_rules <- subset( rules, subset = rhs %pin% "category=Sabotage" & lift > 1.2 ) top_rules <- head(sort(sabotage_rules, by = "lift"), 20) # ── Step 6: Display top rules as clean table ────────────────────── clean_rule_text <- function(x) { x %>% stringr::str_replace_all("facility_cat=fac=", "Facility: ") %>% stringr::str_replace_all("cause=cause=", "Cause: ") %>% stringr::str_replace_all("area=area=", "Area: ") %>% stringr::str_replace_all("state=state=", "State: ") %>% stringr::str_replace_all("cleanup=cleanup=", "Cleanup: ") %>% stringr::str_replace_all("category=category=", "") %>% stringr::str_replace_all("fac=", "Facility: ") %>% stringr::str_replace_all("cause=", "Cause: ") %>% stringr::str_replace_all("area=", "Area: ") %>% stringr::str_replace_all("state=", "State: ") %>% stringr::str_replace_all("cleanup=", "Cleanup: ") %>% stringr::str_replace_all("category=", "") } rules_df <- as(top_rules, "data.frame") %>% dplyr::arrange(desc(lift)) %>% dplyr::mutate( Rule = clean_rule_text(as.character(rules)), Support = round(support, 4), Confidence = round(confidence, 3), Lift = round(lift, 3), Count = as.integer(count) ) %>% dplyr::select(Rule, Support, Confidence, Lift, Count) kableExtra::kable(rules_df, col.names = c("Rule", "Support", "Confidence", "Lift", "Count"), caption = "Table AR1: Top Association Rules — Sabotage as Consequent (Ranked by Lift)") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 11 ) %>% kableExtra::column_spec(1, width = "55%") %>% kableExtra::column_spec(4, bold = TRUE, color = "#e05c2a") %>% kableExtra::footnote( general = "Support = proportion of all 1,695 incidents containing the full rule. Confidence = probability of Sabotage given the antecedent conditions. Lift > 1.0 = association stronger than chance." ) # ── Step 7: All rules scatter plot ─────────────────────────────── all_rules_df <- as(rules, "data.frame") %>% dplyr::mutate( is_sabotage = stringr::str_detect(rules, "category=Sabotage") ) ggplot(all_rules_df, aes(x = support, y = confidence, colour = lift, size = lift, alpha = is_sabotage)) + geom_point() + scale_colour_gradient(low = "#a8c8e8", high = "#e05c2a", name = "Lift") + scale_size_continuous(range = c(0.5, 4), name = "Lift") + scale_alpha_manual(values = c("FALSE" = 0.2, "TRUE" = 0.8), name = "Sabotage Rule") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + labs( title = "Association Rules: Support vs Confidence (Colour = Lift)", subtitle = "Orange-highlighted points = rules with Sabotage as consequent", x = "Support", y = "Confidence" ) + theme_minimal(base_size = 11) + theme(plot.title = element_text(face = "bold")) # ── Step 8: Interactive network graph via arulesViz ─────────────── grDevices::pdf(nullfile()) invisible(capture.output({ network_widget <- plot(top_rules, method = "graph", engine = "htmlwidget", control = list( main = "Association Rule Network: Top 20 Sabotage Rules by Lift", verbose = FALSE )) })) grDevices::dev.off() widget_path <- here::here("output", "ar_network.html") htmlwidgets::saveWidget(network_widget, widget_path, selfcontained = TRUE) htmltools::tags$iframe( src = widget_path, width = "100%", height = "550px", style = "border:none;" ) ``` ## Python ```{python tech5-py, fig.width=10, fig.height=8} #| cache: true #| message: false #| warning: false import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as mticker from mlxtend.frequent_patterns import apriori, association_rules from mlxtend.preprocessing import TransactionEncoder inc_py = r.incidents_py cols = ["cause", "facility_cat", "area", "state", "cleanup_required", "category"] tx_py = inc_py[cols].fillna("Unknown").astype(str).copy() tx_py["cause"] = tx_py["cause"].str.split().str[:4].str.join(" ") prefix_map = { "cause": "cause=", "facility_cat": "fac=", "area": "area=", "state": "state=", "cleanup_required": "cleanup=", "category": "category=" } for col, prefix in prefix_map.items(): tx_py[col] = prefix + tx_py[col] tx_list = tx_py.values.tolist() te = TransactionEncoder() te.fit(tx_list) X = pd.DataFrame(te.transform(tx_list), columns=te.columns_) freq_items = apriori(X, min_support=0.02, use_colnames=True) freq_items["length"] = freq_items["itemsets"].apply(len) rules_py = association_rules( freq_items, metric = "lift", min_threshold = 1.2 ) rules_py["antecedents_str"] = rules_py["antecedents"].apply( lambda x: ", ".join(sorted(x)) ) rules_py["consequents_str"] = rules_py["consequents"].apply( lambda x: ", ".join(sorted(x)) ) sab_rules = rules_py[ rules_py["consequents_str"].str.contains("Sabotage", na=False) ].copy() top_sab = (sab_rules .sort_values("lift", ascending=False) .head(20) [["antecedents_str", "consequents_str", "support", "confidence", "lift"]] .round(4)) def clean_label_py(x): return (x.replace("facility_cat=fac=", "Facility: ") .replace("cause=cause=", "Cause: ") .replace("area=area=", "Area: ") .replace("state=state=", "State: ") .replace("cleanup=cleanup=", "Cleanup: ") .replace("category=category=", "") .replace("fac=", "Facility: ") .replace("cause=", "Cause: ") .replace("area=", "Area: ") .replace("state=", "State: ") .replace("cleanup=", "Cleanup: ") .replace("category=", "")) top_sab["antecedents_str"] = top_sab["antecedents_str"].apply( clean_label_py) top_sab["consequents_str"] = top_sab["consequents_str"].apply( clean_label_py) print("Top 20 Sabotage Rules by Lift:") print(top_sab.to_string(index=False)) fig, ax = plt.subplots(figsize=(10, 6)) non_sab = rules_py[ ~rules_py["consequents_str"].str.contains("Sabotage", na=False) ] sab_sub = rules_py[ rules_py["consequents_str"].str.contains("Sabotage", na=False) ] ax.scatter(non_sab["support"], non_sab["confidence"], c=non_sab["lift"], cmap="Blues", alpha=0.3, s=20, label="Other rules") sc2 = ax.scatter(sab_sub["support"], sab_sub["confidence"], c=sab_sub["lift"], cmap="Oranges", alpha=0.8, s=40, label="Sabotage rules") plt.colorbar(sc2, ax=ax, label="Lift") ax.set_title( "Association Rules: Support vs Confidence\n" "(Orange = Sabotage consequent, Colour intensity = Lift)", fontsize=11, fontweight="bold" ) ax.set_xlabel("Support") ax.set_ylabel("Confidence") ax.xaxis.set_major_formatter(mticker.PercentFormatter(xmax=1)) ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1)) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() top15 = top_sab.head(15).copy() top15["rule_label"] = ( top15["antecedents_str"].str[:45] + " => Sabotage" ) fig, ax = plt.subplots(figsize=(11, 6)) bars = ax.barh(range(len(top15)), top15["lift"], color="#e05c2a", alpha=0.85, height=0.6) ax.set_yticks(range(len(top15))) ax.set_yticklabels(top15["rule_label"], fontsize=8) ax.set_xlabel("Lift") ax.set_title( "Top 15 Association Rules with Sabotage as Consequent (Ranked by Lift)", fontsize=11, fontweight="bold" ) ax.axvline(1.0, linestyle="--", color="grey", linewidth=0.8) ax.invert_yaxis() for bar, conf in zip(bars, top15["confidence"]): ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height() / 2, f"conf={conf:.2f}", va="center", fontsize=7) plt.tight_layout() plt.show() print("\nSummary statistics of all Sabotage rules:") print(sab_rules[["support", "confidence", "lift"]].describe().round(4)) ``` ::: --- ## 9.4 Management Interpretation **What association rules are and how to read them** Association rule mining works like an experienced investigator who has read all 1,695 incident reports and noticed that certain combinations of facts tend to appear together far more often than chance would predict. The algorithm formalises that intuition by scanning every possible combination of cause type, location, facility category, area, state, and cleanup requirement, and surfacing only those combinations that are both frequent enough to be reliable and strong enough to be surprising. Each rule comes with three numbers that tell a precise operational story: - **Support** answers: how often does this exact combination appear across all 1,695 incidents? A support of 0.03 means the combination appears in about 51 incidents — enough to be statistically reliable but specific enough to be operationally targeted - **Confidence** answers: when the antecedent conditions are present, how reliably does Sabotage follow? A confidence of 0.96 means sabotage classification follows in 96 out of every 100 incidents with that attribute combination - **Lift** answers: how much more predictive is this combination than the overall sabotage rate alone? A lift of 1.3 means the combination is 30% more predictive of sabotage than the base rate. This is the number that determines whether a rule is genuinely actionable intelligence or merely a restatement of what the base rate already tells us **What the top rules mean for 2026 planning** The rules with the highest lift values identify the most specific and most reliable combinations of incident attributes that co-occur with sabotage. These rules answer the question the security and community engagement teams most need heading into 2026: not simply where sabotage happens most often, but which precise combinations of location, facility type, and cause pattern are the strongest predictors of sabotage activity. The top-ranked rule, {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305, carries a direct scheduling implication. When a cleanup-requiring incident occurs at an oil pipeline in the Lagos Area, it is classified as sabotage in over 96% of cases. This is not a marginal finding. Resources concentrated at oil pipelines in LAR during periods of elevated third-party activity will intercept a disproportionate share of sabotage events relative to their cost. **What the network graph reveals** The network graph visualises all 20 top rules simultaneously as a connected map. Reading it: - Nodes represent individual incident attributes (facility type, area, state, cleanup status, cause) - Edges represent rules, with arrows pointing toward the Sabotage target node - Nodes with many edges connecting to them are the most diagnostically powerful attributes. They appear in the most high-confidence sabotage rules and anchor the security targeting brief - Hovering over any edge displays the exact lift, confidence, and support for that specific rule - Colour coding distinguishes attribute type: blue for facility type, green for area, purple for state, amber for cleanup status, and grey for cause **What this means for project scope** The rules linking specific cause and facility combinations to cleanup requirements have a direct capital implication. If a high-lift rule confirms that incidents at oil pipelines in SAR reliably require cleanup, that combination should be explicitly costed and scoped in the Brass Canal Remediation and Ebocha Water Reinjection project budgets. It belongs in the base estimate as a predictable and quantifiable cost element, not in the contingency reserve. **Recommendation feeding into Section 10** Two outputs from this section feed directly into the 2026 security strategy: - The top five rules by lift from Table AR1 define the specific operational contexts that warrant elevated security presence and proactive community liaison in 2026 - The antecedent combinations in those rules — the precise combinations of facility type, area, and cause — replace broad geographic targeting with evidence-based attribute targeting that is both more efficient and more defensible to management # 10. Integrated Findings ## 10.1 What the Five Analyses Collectively Show Five analytical techniques were applied to the same 1,695-incident dataset, each addressing a different dimension of the 2026 operational planning problem. Taken individually, each produces a useful output. Taken together, they converge on a single, coherent operational picture that no individual analysis could produce alone. **Text Analytics (Section 5)** The text analytics revealed that incident narrative descriptions cluster into five latent topics. Two are operationally diagnostic: - Topic 2, dominated by the terms *person*, *installed*, *valve*, *cut*, *theft*, and *flowline*, captures the language of deliberate human interference. An unknown person physically installing a bypass valve or cutting a flowline to extract oil - Topic 1, centred on *pipe*, *valve*, *nipple*, and *flange*, captures a second pattern of hardware manipulation at pipe connections Neither pattern maps cleanly onto the structured cause field. They emerge only from the unstructured narrative text. Together they confirm that the dominant failure mode across the portfolio is not mechanical degradation but organised, repeated physical interference at flowlines and valve points. **Survival Analysis (Section 8)** The Kaplan-Meier curves and Cox Proportional Hazards model show that the highest-recurrence facilities are concentrated in the SAR operational area: - The 18" Tebidaba/Brass Pipeline records a median of just 2.0 days between consecutive incidents across 131 recorded intervals - The 10" Clough Creek/Tebidaba Pipeline records a median of 3.0 days across 141 intervals - The Cox model identifies Pipeline (Oil) in SAR and Bayelsa as the asset class carrying the highest hazard rate, controlling for all other factors These are not facilities that occasionally fail. They are facilities under sustained, near-continuous attack. **Association Rules (Section 9)** The highest-lift rule in the dataset confirms the mechanism directly: - {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with confidence 0.964 and lift 1.305 - When a cleanup-requiring incident occurs at an oil pipeline, sabotage classification follows in 96.4% of cases, at a rate 30% above what the overall sabotage base rate would predict - Across all top 20 rules, Pipeline (Oil) appears as the dominant antecedent facility type and cleanup requirement is the single strongest co-occurring operational attribute **Monte Carlo Simulation (Section 6)** The simulation quantifies the financial consequence of this pattern: - P50 annual remediation cost: **$17,136,316** - P90 conservative reserve: **$18,603,017** - The gap of approximately $1.5 million between P50 and P90 is the contingency required to cover costs in 9 out of 10 years - The tornado sensitivity analysis identifies per-incident remediation cost as the dominant driver of uncertainty. Better cost capture in Joint Investigation Visit (JIV) reports would directly narrow the P90 band in future planning cycles **Prophet Forecasting (Section 7)** The forecast provides the timing context: - Mean RMSE of 15.41 incidents per month across walk-forward cross-validation folds reflects the highly variable and externally driven nature of pipeline sabotage in the Niger Delta - The seasonal component confirms that incident frequency is not uniform across the year - The six-month forward projection provides the capacity planning baseline for emergency response resourcing in the first half of 2026 **Convergence visualisation** The chart below summarises how each technique contributes to the single integrated recommendation. ```{r convergence-chart, echo=FALSE, fig.cap="Five-technique convergence map — each analytical output pointing to the same 2026 operational priority", fig.width=11, fig.height=6} #| message: false #| warning: false library(ggplot2) conv_df <- data.frame( Technique = c( "Text Analytics\n(Section 5)", "Survival Analysis\n(Section 8)", "Association Rules\n(Section 9)", "Monte Carlo\n(Section 6)", "Prophet Forecast\n(Section 7)" ), Finding = c( "Topic 2: person, valve,\ncut, theft, flowline", "Tebidaba/Brass: 2.0 days\nClough Creek: 3.0 days", "Pipeline (Oil) + LAR +\nCleanup = Sabotage\n(conf 0.964, lift 1.305)", "P90 Reserve:\n$18,603,017", "RMSE: 15.41/month\nSeasonal peak confirmed" ), Implication = c( "Third-party interference\nis dominant failure mode", "SAR pipelines are\nhighest-hazard assets", "Oil pipelines in SAR/LAR\nare primary sabotage targets", "Hold $18.6M as\ncontingency reserve", "Size Q1 2026 emergency\nresponse capacity now" ), x = 1:5 ) arrow_df <- data.frame( x = 1:5, xend = rep(3, 5), y = rep(1.5, 5), yend = rep(0.7, 5) ) ggplot() + # Technique boxes geom_tile(data = conv_df, aes(x = x, y = 3, width = 0.88, height = 0.7), fill = "#1a5276", colour = "#2c7fb8", linewidth = 0.8) + geom_text(data = conv_df, aes(x = x, y = 3, label = Technique), colour = "white", size = 3.0, fontface = "bold", lineheight = 0.85) + # Finding boxes geom_tile(data = conv_df, aes(x = x, y = 1.8, width = 0.88, height = 0.9), fill = "#2c3e6a", colour = "#555577", linewidth = 0.6) + geom_text(data = conv_df, aes(x = x, y = 1.8, label = Finding), colour = "white", size = 2.5, lineheight = 0.82) + # Arrows from findings to recommendation annotate("segment", x = 1:5, xend = 1:5, y = 1.35, yend = 0.85, colour = "#e67e22", linewidth = 0.7, arrow = arrow(length = unit(0.15, "cm"), type = "closed")) + # Horizontal convergence line annotate("segment", x = 1, xend = 5, y = 0.85, yend = 0.85, colour = "#e67e22", linewidth = 1.0) + # Centre drop arrow annotate("segment", x = 3, xend = 3, y = 0.85, yend = 0.35, colour = "#e67e22", linewidth = 1.2, arrow = arrow(length = unit(0.18, "cm"), type = "closed")) + # Integrated recommendation box annotate("rect", xmin = 1.2, xmax = 4.8, ymin = 0.0, ymax = 0.35, fill = "#e05c2a", colour = "#c0392b", linewidth = 1.0) + annotate("text", x = 3, y = 0.175, label = "CONCENTRATE 2026 RESOURCES ON SAR-AREA OIL PIPELINES\nHold $18.6M contingency reserve | Prioritise Tebidaba & Clough Creek corridors", colour = "white", size = 3.0, fontface = "bold", lineheight = 0.9, hjust = 0.5, vjust = 0.5) + scale_x_continuous(limits = c(0.4, 5.6), expand = c(0, 0)) + scale_y_continuous(limits = c(-0.1, 3.45), expand = c(0, 0)) + labs( title = "Five-Technique Convergence: All Analytical Outputs Point to the Same Priority", subtitle = "Each technique addressed a different analytical question — all five converge on a single 2026 operational recommendation" ) + theme_void() + theme( plot.title = element_text(face = "bold", size = 12, hjust = 0.5, margin = margin(b = 4)), plot.subtitle = element_text(size = 9, hjust = 0.5, colour = "#555555", margin = margin(b = 10)), plot.background = element_rect(fill = "white", colour = NA), plot.margin = margin(10, 10, 10, 10) ) ``` --- ## 10.2 The Single Integrated Recommendation ::: {.callout-important icon=false} **2026 Priority Recommendation:** Concentrate integrity inspection, surveillance, and community engagement resources on oil pipelines in the SAR operational area, specifically the Tebidaba/Brass and Clough Creek/Tebidaba corridors, while holding a remediation contingency reserve of **$18.6 million** in the 2026 annual operating budget. ::: The convergence across all five techniques is striking in its consistency: - The LDA topic model identifies the language of physical interference at flowlines and valve points as the dominant narrative pattern - The survival analysis identifies the Tebidaba and Clough Creek oil pipelines in SAR as the highest-recurrence assets in the portfolio with median inter-incident times of 2 to 3 days - The association rules establish that oil pipeline incidents requiring cleanup are classified as sabotage in over 96% of cases - The Monte Carlo simulation puts the P90 annual cost of this pattern at $18.6 million - The Prophet forecast confirms the pattern is structural and ongoing rather than declining These five outputs, derived independently from different analytical frameworks, all point to the same underlying operational reality: a concentrated, organised pattern of third-party interference targeting oil transmission pipelines in the SAR area, particularly the Tebidaba and Clough Creek corridors, that is generating the majority of OERNL's incident volume, remediation cost exposure, and production deferral risk. The 2026 operational plan should reflect this convergence in four specific actions: ```{r integrated-table, echo=FALSE} #| message: false #| warning: false action_df <- data.frame( Priority = c("1", "2", "3", "4"), Action = c( "Elevate the 18\" Tebidaba/Brass Pipeline and 10\" Clough Creek/Tebidaba Pipeline to mandatory Q1 2026 integrity inspection. Both facilities record median inter-incident times below 3 days across 130-plus repeat intervals", "Hold $18,603,017 as the 2026 remediation contingency reserve in the annual operating budget submission to Finance and the Board. This is the P90 Monte Carlo output", "Direct security patrol and community engagement resources to SAR-area oil pipelines during peak incident months identified by the Prophet seasonal component. Do not distribute resources uniformly across the portfolio", "Expand JIV cost capture to all sabotage incidents, not only the 51 currently recorded. This directly narrows the Monte Carlo P90 band and improves the defensibility of future budget submissions" ), Analytical_Basis = c( "Survival analysis: median time to event = 2.0 days (Tebidaba/Brass, 131 intervals); 3.0 days (Clough Creek, 141 intervals). Cox hazard ratio confirms Pipeline (Oil) in SAR as highest-hazard asset class", "Monte Carlo simulation: P50 = $17,136,316; P90 = $18,603,017. Tornado chart identifies per-incident cost as dominant uncertainty driver", "Association rules: top rule confidence 0.964, lift 1.305 for Pipeline (Oil) and cleanup requirement leading to Sabotage. LDA Topic 2 confirms person-valve-cut-theft as dominant interference narrative", "Monte Carlo limitation: cost distribution fitted from only 51 cost records. Wider cost coverage reduces P90 uncertainty and strengthens the capital plan" ) ) kableExtra::kable(action_df, col.names = c("Priority", "2026 Action", "Analytical Basis"), caption = "Table IF1: Integrated 2026 Operational Recommendations — Five-Technique Convergence") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE ) %>% kableExtra::column_spec(1, bold = TRUE, width = "5%") %>% kableExtra::column_spec(2, width = "52%") %>% kableExtra::column_spec(3, width = "43%", italic = TRUE) %>% kableExtra::row_spec(1, background = "#fef3f0") %>% kableExtra::row_spec(2, background = "#f0f7fe") ``` The four actions above are not independent recommendations from four separate analyses. They are a single, integrated operational response to a single, concentrated problem that five different analytical lenses have independently identified and quantified. The Tebidaba and Clough Creek corridors are the highest-cost, highest-frequency, highest-sabotage-risk assets in the OERNL portfolio. The 2026 plan should treat them accordingly. # 11. Limitations and Further Work ## 11.1 Data Limitations The analyses in this submission were conducted on internally sourced operational data that carries several structural limitations. Each limitation is documented below with its specific impact on the analytical outputs and the remediation action that would address it. **Sparse cost coverage** Of the 1,695 incident records in the consolidated dataset, only 51 contain a populated `cost_usd` field, predominantly from sabotage records in 2019 and 2021. This means the Monte Carlo simulation in Section 6 was fitted from a Gamma distribution calibrated on fewer than 4% of all incidents. ::: {.callout-note icon=false} `r icon(ic_database)` **Example impact:** If the 51 recorded cost incidents systematically over-represent large, expensive sabotage events while smaller incidents go uncosted, the fitted Gamma distribution will have a heavier tail than reality — producing a P90 of $18,603,017 that may be biased upward. The true P90 with full cost coverage could be materially lower. ::: - The remaining 96% of incidents have no recorded remediation cost, creating a sparse and potentially unrepresentative fitting sample - The single most impactful data improvement available to the department is a standing requirement for JIV teams to record a cost estimate for every incident, not only major sabotage events - This one change would directly narrow the P90 band and improve the defensibility of all future budget submissions to Finance and the Board **Inconsistent facility categorisation** Table SA2 shows that several of the highest-recurrence facilities, including the top-ranked 24" Ogoda-Brass Pipeline and multiple Tebidaba corridor assets, carry no value in the `facility_cat` field. ::: {.callout-note icon=false} `r icon(ic_layers)` **Example impact:** The 24" Ogoda-Brass Pipeline ranks first in Table SA2 with a median inter-incident time of 1.0 day across 3 intervals, yet it was excluded from the Cox Proportional Hazards model entirely because its facility category is unpopulated. The model therefore understates the hazard for that asset class. ::: - Facilities without a facility category were excluded from Cox model stratification and the facility category layer of the association rules - The hazard ratios reported in Table SA1 are therefore biased toward the better-documented asset classes, understating the true differential between categorised and uncategorised facilities - A systematic audit of the facility category field across the incident register would immediately improve the precision of both outputs **Cause field granularity** The raw cause field contains free-text entries of highly variable length, from single words like "Corrosion" to full sentences describing the sequence of events. ::: {.callout-note icon=false} `r icon(ic_search)` **Example impact:** The cause entry "Unknown person installed a hacksaw cut on the 18 inch pipeline" was truncated to "Unknown Person Installed A" for the association rule transactions — losing the operationally critical terms "hacksaw cut" and "18 inch pipeline" that would have produced a more specific and more actionable rule. ::: - Truncating to four words for the association rule transactions reduced cardinality but lost meaningful distinctions between cause sub-types - A structured cause taxonomy with a controlled vocabulary of 15 to 20 canonical cause categories, applied consistently at point of entry, would make both the TF-IDF analysis and the association rules significantly more discriminating - This is a data governance change, not a technical one, and requires no additional tooling to implement **Deferred production data** Deferred production volume was available for only 51 records, the same sparse subset as cost data. ::: {.callout-note icon=false} `r icon(ic_chartbar)` **Example impact:** A major sabotage event on the 18" Tebidaba/Brass Pipeline that shuts down production for 14 days at 5,000 barrels per day represents approximately $3.5 million in lost revenue at $50 per barrel — more than twice the direct remediation cost of a typical large incident. This exposure is entirely absent from the current Monte Carlo P90 figure. ::: - The Monte Carlo simulation was constrained to remediation cost alone and could not model the full financial exposure including production deferral - Incorporating deferred production barrels multiplied by the prevailing oil price would give a materially more complete picture of the true P90 financial exposure - The Board-level contingency reserve figure should eventually reflect total financial exposure, not remediation cost alone **No external covariates for forecasting** The Prophet model in Section 7 captures average seasonal patterns in incident frequency but cannot distinguish between a seasonally elevated period driven by weather versus one driven by community unrest. ::: {.callout-note icon=false} `r icon(ic_wave)` **Example impact:** The Q3 2020 spike visible in the monthly incident series coincided with both the wet season and the post-COVID resumption of community activity in the Niger Delta. The model treats these as a single seasonal effect, but their operational implications are entirely different. Weather-driven spikes require additional containment equipment. Community-driven spikes require additional security presence and engagement resources. ::: - External covariates including rainfall data from NIMET, community tension indices, and DPR inspection schedules were not available for integration as Prophet regressors - Their absence means the seasonal component blends structurally distinct drivers into a single undifferentiated pattern - Separating these signals would improve both forecast accuracy and the operational specificity of the capacity planning recommendation --- ## 11.2 Analytical Limitations Beyond data quality, four analytical assumptions constrain the interpretation of the outputs. **The bag-of-words assumption in text analytics** The LDA topic model discards word order and grammatical structure entirely. ::: {.callout-note icon=false} `r icon(ic_brain)` **Example impact:** The sentences "unknown person cut the pipe" and "the pipe cut an unknown person" produce identical token counts under the bag-of-words model and would be assigned identical topic probabilities. The first describes deliberate sabotage. The second describes an accidental injury. The model cannot distinguish them. ::: - For incident narratives where the subject-verb-object structure carries the operational meaning, a dependency-parsed or transformer-based model such as a fine-tuned BERT variant would extract substantially richer causal intelligence - With access to a GPU environment, replacing the LDA pipeline with a BERTopic or Named Entity Recognition model trained on oil and gas incident text would be the natural analytical next step **The proportional hazards assumption in survival analysis** The Cox model was tested via Schoenfeld residuals, and the p-values reported in Section 8 indicate potential violation for some covariates. ::: {.callout-note icon=false} `r icon(ic_listol)` **Example impact:** The hazard ratio for Pipeline (Oil) relative to the reference facility type is reported as approximately 2.6 in Table SA1. If this ratio was substantially higher in 2019 to 2021 and lower in 2023 to 2025, the reported single figure understates the historical peak risk and overstates the current risk. A time-varying coefficient model would surface this change explicitly. ::: - Pipeline operating conditions in the Niger Delta changed materially between 2019 and 2025. The 2020 oil price shock, COVID-19 movement restrictions, and subsequent recovery each created structural breaks that a time-constant hazard ratio cannot fully capture - A time-varying coefficient Cox model, or a parametric accelerated failure time model with explicit changepoint detection, would more honestly represent the non-stationarity in the recurrence process **Association rules are descriptive, not causal** The highest-lift rule establishes a strong co-occurrence pattern but does not prove causation. ::: {.callout-note icon=false} `r icon(ic_clipboard)` **Example impact:** The rule {Facility: Pipeline (Oil), Area: LAR, Cleanup: YES} → {Sabotage} with lift 1.305 could reflect a reporting bias rather than a real pattern. LAR field engineers may simply be more diligent at recording cleanup requirements, which artificially inflates the co-occurrence with sabotage classification. The rule would then be a documentation artefact rather than an operational signal. ::: - Validating the rules against an independent holdout dataset, or against a separate year of incident records not used in model training, would distinguish genuine operational patterns from documentation artefacts - Until validated, the top rules should be treated as priority hypotheses for field investigation rather than confirmed operational facts **The Prophet forecast carries high uncertainty at monthly level** The mean RMSE of 15.41 incidents per month and MAPE of 114.7% reflect the fundamental unpredictability of sabotage-driven incident counts. ::: {.callout-note icon=false} `r icon(ic_calendar)` **Example impact:** A forecast of 20 incidents for March 2026 with a 90% confidence interval of 5 to 35 incidents means the actual count could be anywhere from a quiet month to the highest recorded monthly total in the dataset. Sizing emergency response capacity for exactly 20 incidents and actually experiencing 35 would leave the department 75% under-resourced for that month. ::: - Unlike demand forecasting in retail or manufacturing where Prophet was originally designed, pipeline sabotage in the Niger Delta is driven by social, economic, and security dynamics that no time series model can capture from incident frequency alone - The model is useful for annual-level capacity planning and seasonal pattern identification but should not be used to make precise monthly staffing decisions without an explicit uncertainty buffer --- ## 11.3 Further Work The table below summarises what would be done differently with more data, time, or computing power across each of the five techniques. ```{r further-work-table, echo=FALSE} #| message: false #| warning: false fw_df <- data.frame( Technique = c( "Text Analytics", "Monte Carlo Simulation", "Advanced Forecasting", "Survival Analysis", "Association Rules" ), Current_Limitation = c( "LDA bag-of-words loses word order and semantic context in short incident narratives", "Cost distribution fitted from only 51 records. 96% of incidents have no cost data", "MAPE of 114.7% reflects unpredictability of sabotage-driven counts without external covariates", "Cox model assumes time-constant hazard ratios across a 7-year window with structural breaks", "Rules are descriptive co-occurrences. Causal direction not validated against holdout data" ), Further_Work = c( "Replace LDA with BERTopic or fine-tuned BERT Named Entity Recognition on GPU. Extract structured cause entities directly from narrative text", "Mandate JIV cost capture for all incidents. Refit Gamma distribution on full portfolio. Extend simulation to include deferred production value", "Add external regressors to Prophet — NIMET rainfall, community tension index, crude price — to separate structural from externally triggered spikes", "Fit time-varying coefficient Cox model with explicit changepoint detection at 2020 and 2022 structural breaks", "Build gradient-boosted classifier using top rule antecedents as features. Moves from retrospective pattern detection to real-time sabotage risk flagging" ), Resource_Required = c( "GPU compute environment. Labelled training corpus of 500-plus incident narratives", "Operational policy change requiring mandatory cost field in incident register from 2026", "Data sharing agreement with NIMET. Access to Niger Delta Security Tracker", "No additional data required. Analytical extension of current survival dataset", "12 months of new incident records as holdout validation set" ) ) kableExtra::kable(fw_df, col.names = c("Technique", "Current Limitation", "Proposed Further Work", "Resource Required"), caption = "Table FW1: Further Work — What Would Be Done Differently") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 11 ) %>% kableExtra::column_spec(1, bold = TRUE, width = "14%") %>% kableExtra::column_spec(2, width = "25%") %>% kableExtra::column_spec(3, width = "36%") %>% kableExtra::column_spec(4, width = "25%", italic = TRUE) ``` # References ## Primary Data Source Olanubi, W. (2026). *OERNL pipeline incident register 2019–2025* [Dataset]. Compiled from Joint Investigation Visit reports and the HSE incident management system, Project Services Department, Oando Energy Resources Nigeria Limited, Lagos, Nigeria. Data available on request from the author. --- ## Course Textbook Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R*. Lagos Business School / markanalytics.online. https://markanalytics.online --- ## Academic and Professional References Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In *Proceedings of the 20th International Conference on Very Large Data Bases* (pp. 487–499). Morgan Kaufmann. Carsey, T. M., & Harden, J. J. (2013). *Monte Carlo simulation and resampling methods for social science* (1st ed.). SAGE Publications. Glasserman, P. (2003). *Monte Carlo methods in financial engineering* (Vol. 53). Springer. Kleinbaum, D. G., & Klein, M. (2012). *Survival analysis: A self-learning text* (3rd ed.). Springer. https://doi.org/10.1007/978-1-4419-6646-9 Marr, B. (2017). *Key business analytics: The 60+ business analysis tools every manager needs to know*. Pearson Education. Taylor, S. J., & Letham, B. (2018). Forecasting at scale. *The American Statistician*, *72*(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080 Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software*, *4*(43), 1686. https://doi.org/10.21105/joss.01686 Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis*. Springer. https://doi.org/10.1007/978-3-319-24277-4 --- ## R Software and Package Citations R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/ Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.5) [Computer software]. https://doi.org/10.5281/zenodo.5960048 Bürkner, P. C., Gabry, J., Kay, M., & Vehtari, A. (2022). *posterior: Tools for working with posterior distributions* (R package). https://mc-stan.org/posterior/ Firke, S. (2023). *janitor: Simple tools for examining and cleaning dirty data* (R package version 2.2.0). https://CRAN.R-project.org/package=janitor Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. *Journal of Statistical Software*, *40*(3), 1–25. https://doi.org/10.18637/jss.v040.i03 Grun, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. *Journal of Statistical Software*, *40*(13), 1–30. https://doi.org/10.18637/jss.v040.i13 Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. *Journal of Statistical Software*, *14*(15), 1–25. https://doi.org/10.18637/jss.v014.i15 Hahsler, M. (2017). arulesViz: Interactive visualization of association rules with R. *R Journal*, *9*(2), 163–175. https://doi.org/10.32614/RJ-2017-047 Kassambara, A., & Kosinski, M. (2021). *survminer: Drawing survival curves using ggplot2* (R package version 0.4.9). https://CRAN.R-project.org/package=survminer Müller, K., & Wickham, H. (2023). *tibble: Simple data frames* (R package version 3.2.1). https://CRAN.R-project.org/package=tibble Pedersen, T. L. (2022). *patchwork: The composer of plots* (R package version 1.1.2). https://CRAN.R-project.org/package=patchwork Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny*. Chapman and Hall/CRC. https://plotly-r.com Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in R. *Journal of Open Source Software*, *1*(3), 37. https://doi.org/10.21105/joss.00037 Spinu, V., Grolemund, G., & Wickham, H. (2023). *lubridate: Make dealing with dates a little easier* (R package version 1.9.3). https://CRAN.R-project.org/package=lubridate Taylor, S. J., & Letham, B. (2021). *prophet: Automatic forecasting procedure* (R package version 1.0). https://CRAN.R-project.org/package=prophet Therneau, T. M. (2023). *A package for survival analysis in R* (R package version 3.5-7). https://CRAN.R-project.org/package=survival Therneau, T. M., & Grambsch, P. M. (2000). *Modeling survival data: Extending the Cox model*. Springer. https://doi.org/10.1007/978-1-4757-3294-8 Wickham, H. (2022). *stringr: Simple, consistent wrappers for common string operations* (R package version 1.5.0). https://CRAN.R-project.org/package=stringr Wickham, H., & Bryan, J. (2023). *readxl: Read Excel files* (R package version 1.4.3). https://CRAN.R-project.org/package=readxl Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). *dplyr: A grammar of data manipulation* (R package version 1.1.4). https://CRAN.R-project.org/package=dplyr Wickham, H., & Henry, L. (2023). *purrr: Functional programming tools* (R package version 1.0.2). https://CRAN.R-project.org/package=purrr Wickham, H., Vaughan, D., & Girlich, M. (2023). *tidyr: Tidy messy data* (R package version 1.3.0). https://CRAN.R-project.org/package=tidyr Zeileis, A., & Grothendieck, G. (2005). zoo: S3 infrastructure for regular and irregular time series. *Journal of Statistical Software*, *14*(6), 1–27. https://doi.org/10.18637/jss.v014.i06 --- ## Python Package Citations Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. *Nature*, *585*, 357–362. https://doi.org/10.1038/s41586-020-2649-2 Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, *9*(3), 90–95. https://doi.org/10.1109/MCSE.2007.55 McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, *12*, 2825–2830. Rasband, D. (2021). *vaderSentiment: VADER sentiment analysis* (Python package version 3.3.2). https://github.com/cjhutto/vaderSentiment Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. In *Proceedings of the 9th Python in Science Conference* (pp. 92–96). https://doi.org/10.25080/Majora-92bf1922-011 Taylor, S. J., & Letham, B. (2021). *Prophet: Forecasting at scale* (Python package version 1.1). https://facebook.github.io/prophet/ Van Rossum, G., & Drake, F. L. (2009). *Python 3 reference manual*. CreateSpace. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. *Nature Methods*, *17*, 261–272. https://doi.org/10.1038/s41592-019-0686-2 Waskom, M. (2021). Seaborn: Statistical data visualization. *Journal of Open Source Software*, *6*(60), 3021. https://doi.org/10.21105/joss.03021 --- # Appendix: AI Usage Statement Claude (Anthropic, claude.ai) was used as an AI coding assistant throughout this submission. Specifically, Claude assisted with the generation of **R and Python code chunks** for data loading, cleaning, and transformation in the setup chunk; the construction of **ggplot2 and matplotlib visualisations**; the implementation of the **LDA topic model, Monte Carlo simulation, Prophet forecasting, Kaplan-Meier and Cox Proportional Hazards survival analysis, and Apriori association rule mining pipelines**; and the debugging of rendering errors encountered during Quarto document compilation. The **document structure, YAML configuration, and inline prose** were also developed with Claude's assistance as a drafting and formatting tool. Independent analytical judgement was exercised in the following areas: - **Case study selection:** Case Study 3 was selected as the most appropriate case for my professional context, given direct access to seven years of internal pipeline incident data from OERNL's HSE management system - **Central analytical question:** The framing of the research question and the mapping of each technique to a specific 2026 operational decision were determined independently, based on knowledge of the planning decisions facing the Project Services Department heading into 2026 - **Model output interpretation:** All model outputs were interpreted in the context of OERNL's operational reality, including the identification of the Tebidaba and Clough Creek corridors as the priority intervention assets and the translation of the P90 Monte Carlo figure into a specific Board-level budget recommendation - **LDA topic naming:** The five LDA topics were named and operationally interpreted based on domain knowledge of Niger Delta pipeline operations, not on algorithmic output alone. The algorithm surfaces word clusters; the operational meaning of those clusters was determined by the author - **Data quality identification:** The identification of sparse cost coverage, inconsistent facility categorisation, and cause field granularity as the primary data limitations was based on direct familiarity with the source system and its known recording practices - **Integrated recommendation:** The recommendation in Section 10 draws on seven years of direct professional experience in the Project Services Department to connect the analytical outputs to live planning decisions facing the organisation in 2026 All analytical decisions, business interpretations, and operational conclusions are my own.

	n_components n_components: int, default=10 Number of topics. .. versionchanged:: 0.19 ``n_topics`` was renamed to ``n_components``	5
	doc_topic_prior doc_topic_prior: float, default=None Prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `alpha`.	None
	topic_word_prior topic_word_prior: float, default=None Prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. In [1]_, this is called `eta`.	None
	learning_method learning_method: {'batch', 'online'}, default='batch' Method used to update `_component`. Only used in :meth:`fit` method. In general, if the data size is large, the online update will be much faster than the batch update. Valid options: - 'batch': Batch variational Bayes method. Use all training data in each EM update. Old `components_` will be overwritten in each iteration. - 'online': Online variational Bayes method. In each EM update, use mini-batch of training data to update the ``components_`` variable incrementally. The learning rate is controlled by the ``learning_decay`` and the ``learning_offset`` parameters. .. versionchanged:: 0.20 The default learning method is now ``"batch"``.	'batch'
	learning_decay learning_decay: float, default=0.7 It is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is ``n_samples``, the update method is same as batch learning. In the literature, this is called kappa.	0.7
	learning_offset learning_offset: float, default=10.0 A (positive) parameter that downweights early iterations in online learning. It should be greater than 1.0. In the literature, this is called tau_0.	10.0
	max_iter max_iter: int, default=10 The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the :meth:`fit` method, and not the :meth:`partial_fit` method.	20
	batch_size batch_size: int, default=128 Number of documents to use in each EM iteration. Only used in online learning.	128
	evaluate_every evaluate_every: int, default=-1 How often to evaluate perplexity. Only used in `fit` method. set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.	-1
	total_samples total_samples: int, default=1e6 Total number of documents. Only used in the :meth:`partial_fit` method.	1000000.0
	perp_tol perp_tol: float, default=1e-1 Perplexity tolerance. Only used when ``evaluate_every`` is greater than 0.	0.1
	mean_change_tol mean_change_tol: float, default=1e-3 Stopping tolerance for updating document topic distribution in E-step.	0.001
	max_doc_update_iter max_doc_update_iter: int, default=100 Max number of iterations for updating document topic distribution in the E-step.	100
	n_jobs n_jobs: int, default=None The number of jobs to use in the E-step. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	verbose verbose: int, default=0 Verbosity level.	0
	random_state random_state: int, RandomState instance or None, default=None Pass an int for reproducible results across multiple function calls. See :term:`Glossary `.	42