Beyond the Numbers: Operational Analytics of HSE Incidents at IHS Towers Nigeria (2025-2026)

Author

Peace Ugoala — Lagos Business School EMBA

Published

May 19, 2026

IHS Towers

Beyond the Numbers: Operational Analytics of HSE Incidents at IHS Towers Nigeria

Incident Analytics Report  ·  2025 – 2026
Peace Ugoala
Lagos Business School  ·  EMBA Programme
Health, Safety & Environment  ·  Nigeria Operations

1 Executive Summary

This study analyses 1,246 HSE incident records from IHS Towers Nigeria — Africa’s largest independent telecommunications tower operator — spanning May 2025 to April 2026, extracted from the company’s VelocityEHS incident management system. Five operational analytics techniques are applied to extract decision-relevant intelligence: TF-IDF and Latent Dirichlet Allocation (LDA) topic modelling on incident narratives, Monte Carlo simulation of annual loss exposure, Prophet weekly forecasting of incident volume, Kaplan-Meier survival analysis on investigation closure times, and Apriori association rule mining on co-occurring incident types.

Three findings drive the recommendations. First, diesel and asset security incidents dominate the operational risk landscape, with topic modelling revealing five coherent themes that together account for over 70% of all reported events. Second, Monte Carlo simulation estimates annual loss exposure with a meaningful P90-P50 gap that defines the realistic worst-case for reserve setting. Third, median time-to-closure for investigations is long with a heavy right tail, suggesting workflow bottlenecks in the investigation-to-review handoff.

The single integrated recommendation is to prioritise three control interventions: (1) enhanced anti-theft hardening for AC/DC generation and battery assets, where association rules show elevated co-occurrence risk with environmental releases; (2) a structured 30-day investigation SLA to compress the survival curve’s tail; and (3) Prophet-forecast-driven proactive resourcing of HSE response capacity in high-risk weeks identified by the model.

2 Professional Disclosure

I am a Senior Manager, HSE at IHS Towers Nigeria Limited, an independent telecommunications tower infrastructure company operating over 16,000 towers across Nigeria and one of the largest tower operators globally. My day-to-day work involves providing strategic leadership, governance, compliance, incident management, environmental sustainability, and operational oversight for HSE across the company’s Nigerian footprint — while leading a regional team that translates those strategies into safe day-to-day field operations.

Text analytics and topic modelling is directly relevant to my work because HSE reporters across thousands of tower sites enter narrative descriptions in VelocityEHS that are rarely systematically mined. The free-text Description field contains operational intelligence — recurring failure modes, contractor behaviour patterns, environmental risk signals — that standard categorical analytics miss.

Monte Carlo simulation is relevant because IHS’s annual HSE budget and insurance reserving exercises require quantified estimates of loss exposure under uncertainty. Point estimates of “expected loss” hide the tail risk that drives reserving decisions. A simulation-based P10/P50/P90 framing supports both the safety case for risk control investment and the finance team’s reserve-setting discussions.

Advanced forecasting (Prophet) matters operationally because HSE response capacity (investigators, regional safety officers, on-call contractor mobilisation) is finite and must be deployed proactively. A weekly forecast with confidence intervals supports rostering and contractor availability discussions one quarter ahead.

Kaplan-Meier survival analysis is directly relevant to my workflow because incident closure cycle time is a regulatory and internal compliance metric. Survival estimation with proper handling of in-progress cases gives a defensible view of true median closure time.

Association rule mining (Apriori) addresses a real operational question: which incident types co-occur in the same investigation? Identifying co-occurrences supports bundled control interventions rather than treating each incident type in isolation.

3 Data Collection & Sampling

Source. Primary data was extracted from IHS Towers Nigeria’s VelocityEHS HSE management system — the company’s authoritative incident reporting platform where all field-reported HSE events are logged by HSE officers, site engineers, and contractors.

Period. The extraction covers incidents with incident_date from 2025-05-01 to 2026-04-12, yielding 1,246 records. Records from earlier periods were excluded because monthly reporting volume before May 2025 was anomalously low (<20 per month versus 100+ from May 2025 onward), reflecting a reporting-culture change rather than a real risk change.

Sampling. For techniques requiring manual severity classification, a stratified random sample of 308 incidents was drawn from the 1,246-record population. Strata were defined by primary_type. Within each stratum, simple random sampling without replacement was applied using numpy.random.default_rng(seed=42) for reproducibility.

Severity scoring and AI disclosure. Severity scores (1–5) were applied using a rule-based algorithm specified a priori by the author and applied mechanically by Anthropic Claude. The author manually reviewed a 10% random subsample (30 rows). Cost estimates use a (primary_type × severity) lookup matrix with ±20% random variation. Closure dates were populated only for the 63 incidents with Status = "Closed"; the remaining 245 are treated as right-censored.

Ethics and anonymisation. Permission obtained from the IHS Towers HSE management team. Personally identifiable fields removed. No customer data present.

Citation: Peace Ugoala. (2026). IHS Towers Nigeria HSE incident register, May 2025 – April 2026 [Dataset]. VelocityEHS export.

4 Data Description

Dataset Dimensions
Dataset Rows Columns
Full dataset 1246 13
Scored sample 308 17
Incident Status Distribution — Scored Sample (n = 308)
Status Count Share (%)
Investigation 174 56.5
Review 67 21.8
Closed 63 20.5
Pending Closure 4 1.3
Severity Score Distribution — Scored Sample
Severity Label Count Share (%)
1 1 – Negligible 11 3.6
2 2 – Minor 52 16.9
3 3 – Moderate 190 61.7
4 4 – Major 42 13.6
5 5 – Critical 13 4.2
Top 6 Incident Types — Full Dataset (n = 1,246)
Incident Type Incidents Share (%)
Environmental Release 260 20.9
Unsafe Condition 241 19.3
Property Damage 220 17.7
Security 173 13.9
Equipment Failure 153 12.3
Fire 132 10.6
Key Numeric Summaries — Scored Sample
Metric Value
Severity scores available 308
Mean severity score 2.98
Median incident cost ₦449,912
Mean incident cost ₦1,535,080
Median days to closure 39
Mean days to report 2.4

5 Analysis 1 — Text Analytics & Topic Modelling

5.1 Theory recap

TF-IDF identifies words distinctive to particular documents — frequent within a document but rare across the corpus. Latent Dirichlet Allocation (LDA) treats each document as a mixture of topics, where each topic is a distribution over words.

5.2 Business justification

The Description field is the richest but least-mined data source in VelocityEHS. Topic modelling lets the data speak — surfacing recurring patterns that the formal taxonomy misses.

5.3 TF-IDF — distinctive vocabulary per incident type

Code
library(tidytext)
library(stopwords)

ihs_stopwords <- c("ihs","site","tower","incident","report","reported","noted",
                   "observed","see","image","attached","fyi","kindly","please",
                   "find","also","new","day","date","time","one","two")

docs <- full_df |>
  filter(!is.na(description)) |>
  mutate(doc_id = row_number(),
         text   = paste(title, description, sep = ". ")) |>
  select(doc_id, primary_type, text)

tokens <- docs |>
  unnest_tokens(word, text) |>
  filter(!word %in% c(stopwords("en"), ihs_stopwords)) |>
  filter(nchar(word) > 2, !str_detect(word, "^\\d+$"))

tfidf <- tokens |>
  count(primary_type, word) |>
  bind_tf_idf(word, primary_type, n)

top_types <- full_df |>
  count(primary_type, sort = TRUE) |>
  slice_head(n = 6) |>
  pull(primary_type)

tfidf_plot_data <- tfidf |>
  filter(primary_type %in% top_types) |>
  group_by(primary_type) |>
  slice_max(tf_idf, n = 5, with_ties = FALSE) |>
  ungroup() |>
  mutate(
    word  = reorder_within(word, tf_idf, primary_type),
    label = str_replace(word, "___.*", "")
  )

p_tfidf <- tfidf_plot_data |>
  ggplot(aes(
    x      = tf_idf,
    y      = word,
    fill   = primary_type,
    text   = paste0(
      "<b>Word:</b> ", label, "<br>",
      "<b>Type:</b> ", primary_type, "<br>",
      "<b>TF-IDF:</b> ", round(tf_idf, 4)
    )
  )) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  facet_wrap(~primary_type, scales = "free_y", ncol = 2) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Distinctive vocabulary per incident type (TF-IDF)",
    subtitle = "Top 5 most distinctive words for each of the 6 most common types",
    x        = "TF-IDF score",
    y        = NULL
  )

ggplotly(p_tfidf, tooltip = "text", height = 800) |>
  layout(
    margin  = list(l = 160, r = 30, t = 80, b = 60),
    font    = list(size = 11),
    xaxis   = list(fixedrange = FALSE),
    # push subplot panels apart so strip labels don't collide with axes
    grid    = list(ygap = 0.18, xgap = 0.12)
  )

5.4 LDA topic model (K = 5)

Code
library(topicmodels)
library(tm)
library(slam)

dtm <- tokens |>
  count(doc_id, word) |>
  filter(n > 0) |>
  cast_dtm(doc_id, word, n)

dtm <- dtm[slam::row_sums(dtm) > 2, ]

set.seed(42)
K         <- 5
lda_model <- LDA(dtm, k = K, control = list(seed = 42, alpha = 0.5))

topics_tidy <- tidy(lda_model, matrix = "beta")

lda_plot_data <- topics_tidy |>
  group_by(topic) |>
  slice_max(beta, n = 10, with_ties = FALSE) |>
  ungroup() |>
  mutate(
    topic_label = paste("Topic", topic),
    term        = reorder_within(term, beta, topic_label)
  )

p_lda <- lda_plot_data |>
  ggplot(aes(
    x    = beta,
    y    = term,
    fill = topic_label,
    text = paste0(
      "<b>Term:</b> ", str_replace(term, "___.*", ""), "<br>",
      "<b>Topic:</b> ", topic_label, "<br>",
      "<b>β:</b> ", round(beta, 4)
    )
  )) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  facet_wrap(~topic_label, scales = "free_y", ncol = 3) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = paste("LDA topics — top 10 terms per topic (K =", K, ")"),
    x     = "Word probability within topic (β)",
    y     = NULL
  )

ggplotly(p_lda, tooltip = "text", height = 850) |>
  layout(
    margin = list(l = 160, r = 30, t = 80, b = 60),
    font   = list(size = 11),
    grid   = list(ygap = 0.18, xgap = 0.12)
  )

5.5 Sentiment trend (AFINN lexicon)

Code
library(syuzhet)

sent_df <- full_df |>
  filter(!is.na(description)) |>
  mutate(
    text     = paste(coalesce(title, ""), coalesce(description, ""), sep = ". "),
    compound = get_sentiment(text, method = "afinn")
  )

monthly_sent <- sent_df |>
  mutate(month = floor_date(incident_date, "month")) |>
  group_by(month) |>
  summarise(
    mean_sent = mean(compound, na.rm = TRUE),
    n         = n(),
    .groups   = "drop"
  )

p_sent <- monthly_sent |>
  ggplot(aes(
    x    = month,
    y    = mean_sent,
    text = paste0(
      "<b>Month:</b> ", format(month, "%b %Y"), "<br>",
      "<b>Mean sentiment:</b> ", round(mean_sent, 3), "<br>",
      "<b>Incidents:</b> ", n
    )
  )) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_line(colour = "#00421E", linewidth = 1) +
  geom_point(colour = "#00421E", size = 2) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  labs(
    title    = "Monthly sentiment trend in HSE incident narratives (AFINN)",
    subtitle = "Mean AFINN sentiment score per month (negative = bad)",
    y        = "Mean sentiment",
    x        = NULL
  )

ggplotly(p_sent, tooltip = "text") |>
  layout(
    xaxis  = list(tickangle = -45, tickfont = list(size = 10)),
    margin = list(l = 60, r = 20, t = 60, b = 90)
  )
Code
cat("\nOverall mean sentiment (AFINN):",
    round(mean(sent_df$compound, na.rm = TRUE), 3), "\n")

Overall mean sentiment (AFINN): -1.949 
Code
cat("% negative narratives:",
    round(mean(sent_df$compound < 0, na.rm = TRUE) * 100, 1), "%\n")
% negative narratives: 69.3 %

5.6 Interpretation

The TF-IDF analysis identifies linguistically distinctive vocabulary per incident type — for example, terms like “diesel,” “spillage,” “tank,” and “litres” dominate Environmental Release, while “battery,” “cable,” “stolen,” and “vandal” dominate Security.

The LDA model with K=5 yields five interpretable topics. Examining the top terms suggests labels like: Asset Theft, Fuel & Environmental, Equipment Faults, Site Access & Security, and Civil & Structural.

For a non-technical manager: “Our incident narratives describe consistently negative events month-over-month. The language is not getting more positive over time, suggesting we have not yet seen the cultural shift that would accompany reduced incident severity.”

6 Analysis 2 — Monte Carlo Simulation

6.1 Theory recap

Monte Carlo simulation estimates the probability distribution of an outcome by repeated random sampling from input distributions. It answers “What is the 90th percentile annual loss?” and “Which input drives most of the output variance?” — questions point estimates cannot answer.

6.2 Business justification

IHS’s annual HSE budget and self-insurance reserving exercise require quantified loss exposure with proper uncertainty bands. Monte Carlo gives Finance the P90 number they need for reserve-setting.

6.3 Simulation and percentile table

Code
set.seed(42)

monthly_counts <- full_df |>
  mutate(ym = floor_date(incident_date, "month")) |>
  count(ym) |>
  pull(n)
lambda_annual <- mean(monthly_counts) * 12

sev_probs <- sample_df |>
  filter(!is.na(severity)) |>
  count(severity) |>
  mutate(p = n / sum(n)) |>
  arrange(severity)

cost_params <- sample_df |>
  filter(!is.na(cost_ngn), cost_ngn > 0) |>
  group_by(severity) |>
  summarise(
    meanlog = mean(log(cost_ngn)),
    sdlog   = sd(log(cost_ngn)),
    .groups = "drop"
  )

# Vectorised simulation: sample all incidents at once, then vectorised rlnorm
simulate_year <- function(lam_mult = 1, sd_mult = 1, sev_p = sev_probs$p) {
  n_inc <- rpois(1, lambda_annual * lam_mult)
  if (n_inc == 0L) return(0)
  sevs <- sample(sev_probs$severity, n_inc, replace = TRUE, prob = sev_p)
  # Vectorised lookup then vectorised rlnorm (no sapply loop)
  idx      <- match(sevs, cost_params$severity)
  ml       <- ifelse(is.na(idx), log(100000), cost_params$meanlog[idx])
  sl       <- ifelse(is.na(idx), 1,           cost_params$sdlog[idx]) * sd_mult
  sl[is.na(sl)] <- 1
  sum(rlnorm(n_inc, ml, sl))
}

N_SIM        <- 10000
sim_results  <- replicate(N_SIM, simulate_year())

p50 <- quantile(sim_results, 0.50)
p90 <- quantile(sim_results, 0.90)

pct_tbl <- tibble(
  Percentile        = c("P10", "P50 (median)", "Mean", "P90", "P95", "P99"),
  `Annual Loss (NGN)` = c(
    quantile(sim_results, 0.10),
    quantile(sim_results, 0.50),
    mean(sim_results),
    quantile(sim_results, 0.90),
    quantile(sim_results, 0.95),
    quantile(sim_results, 0.99)
  )
) |>
  mutate(`Annual Loss (NGN)` = scales::comma(`Annual Loss (NGN)`, prefix = "₦"))

DT::datatable(
  pct_tbl,
  caption  = "Simulated annual HSE loss exposure (10,000 Monte Carlo trials)",
  options  = list(dom = "t", pageLength = 10),
  rownames = FALSE
)
Code
# Interactive histogram with P50 / P90 reference lines
plot_ly(
  x    = sim_results / 1e9,
  type = "histogram",
  nbinsx = 60,
  marker = list(color = "#00421E", line = list(color = "white", width = 0.5)),
  hovertemplate = "Loss: ₦%{x:.2f}bn<br>Count: %{y}<extra></extra>"
) |>
  add_segments(
    x = p50 / 1e9, xend = p50 / 1e9,
    y = 0,         yend = N_SIM * 0.12,
    line = list(color = "white", width = 2, dash = "dash"),
    name = paste0("P50: ₦", scales::comma(round(p50 / 1e9, 2)), "bn")
  ) |>
  add_segments(
    x = p90 / 1e9, xend = p90 / 1e9,
    y = 0,         yend = N_SIM * 0.12,
    line = list(color = "#E07B00", width = 2, dash = "dash"),
    name = paste0("P90: ₦", scales::comma(round(p90 / 1e9, 2)), "bn")
  ) |>
  layout(
    title  = list(text = "Distribution of simulated annual HSE loss exposure"),
    xaxis  = list(title = "Annual loss (NGN billions)"),
    yaxis  = list(title = "Frequency"),
    legend = list(orientation = "h", y = -0.15),
    plot_bgcolor  = "white",
    paper_bgcolor = "white"
  )

6.4 Tornado sensitivity

Code
set.seed(42)
N_TORNADO <- 1000

base_p50 <- median(sim_results) / 1e9

shock_p50 <- function(lam_mult = 1, sd_mult = 1) {
  median(replicate(N_TORNADO, simulate_year(lam_mult, sd_mult))) / 1e9
}

scenarios <- tibble(
  scenario = c("Incident rate +20%", "Incident rate -20%",
               "Cost variance +20%", "Cost variance -20%"),
  p50      = c(shock_p50(1.2, 1.0), shock_p50(0.8, 1.0),
               shock_p50(1.0, 1.2), shock_p50(1.0, 0.8))
) |>
  mutate(
    delta     = p50 - base_p50,
    direction = if_else(delta > 0, "Worse", "Better"),
    scenario  = fct_reorder(scenario, abs(delta))
  )

p_tornado <- scenarios |>
  ggplot(aes(
    x    = delta,
    y    = scenario,
    fill = direction,
    text = paste0(
      "<b>Scenario:</b> ", scenario, "<br>",
      "<b>Direction:</b> ", direction, "<br>",
      "<b>Δ P50:</b> ₦", round(delta, 3), "bn"
    )
  )) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black", linewidth = 0.4) +
  scale_fill_manual(values = c("Worse" = "#E07B00", "Better" = "#00421E")) +
  labs(
    title    = "Tornado sensitivity — drivers of annual loss variance",
    subtitle = paste0("Baseline P50: ₦", scales::comma(base_p50, accuracy = 0.01), "bn"),
    x        = "Change in P50 annual loss (NGN billions)",
    y        = NULL,
    fill     = NULL
  )

ggplotly(p_tornado, tooltip = "text") |>
  layout(
    margin = list(l = 220, r = 30, t = 70, b = 60),
    yaxis  = list(tickfont = list(size = 10)),
    legend = list(orientation = "h", y = -0.12)
  )

6.5 Interpretation

The simulation produces a right-skewed annual loss distribution characteristic of insurance-relevant risk. The median outcome (P50 = ₦1.91bn) represents the expected loss in a typical year, while the 90th-percentile outcome (P90 = ₦2.10bn) defines the realistic bad-year scenario. The P90–P50 gap of ₦0.19bn (+10%) is the stretch-case exposure that capital allocation and reserve-setting discussions should address.

For IHS’s annual reserving discussion: “On a typical year we should expect approximately ₦1.91bn in direct HSE-related losses, but our reserve should be set at ₦2.10bn to absorb realistic bad years. The ₦0.19bn gap between P50 and P90 defines the stretch case that capital allocation discussions should address.”

The tornado analysis identifies incident rate as the largest driver of P50 variance — actionable because prevention investment has more leverage on the loss distribution than reducing cost variance.

7 Analysis 3 — Prophet Weekly Forecasting

7.1 Theory recap

Prophet decomposes a time series into trend, weekly seasonality, yearly seasonality, and event effects. It is robust to missing data and outliers and produces explicit uncertainty intervals.

7.2 Business justification

IHS HSE response capacity is finite. A weekly forecast supports proactive resourcing: scheduling investigators, briefing regional safety officers, and pre-positioning incident-response contractors for high-risk weeks.

7.3 Prophet model and 13-week forecast

Code
library(prophet)

weekly <- full_df |>
  mutate(week = floor_date(incident_date, "week", week_start = 1)) |>
  count(week) |>
  rename(ds = week, y = n)

m <- prophet(
  weekly,
  weekly.seasonality  = FALSE,   # data is weekly-aggregated; no sub-weekly info to fit
  yearly.seasonality  = FALSE,
  daily.seasonality   = FALSE,
  interval.width      = 0.80
)

future <- make_future_dataframe(m, periods = 13, freq = "week")
fcst   <- predict(m, future)

# Separate historical and forecast periods for plotting
hist_end   <- max(weekly$ds)
fcst_only  <- fcst |> filter(ds > hist_end)
fcst_hist  <- fcst |> filter(ds <= hist_end)

plot_ly() |>
  # 80% prediction band (forecast only)
  add_ribbons(
    data       = fcst_only,
    x          = ~ds,
    ymin       = ~yhat_lower,
    ymax       = ~yhat_upper,
    fillcolor  = "rgba(31,119,180,0.15)",
    line       = list(color = "transparent"),
    name       = "80% interval",
    hoverinfo  = "skip"
  ) |>
  # Fitted line (historical)
  add_lines(
    data = fcst_hist,
    x    = ~ds,
    y    = ~yhat,
    line = list(color = "#1f77b4", width = 1.5, dash = "dot"),
    name = "Fitted"
  ) |>
  # Forecast line
  add_lines(
    data = fcst_only,
    x    = ~ds,
    y    = ~yhat,
    line = list(color = "#1f77b4", width = 2),
    name = "Forecast",
    hovertemplate = paste0(
      "<b>Week:</b> %{x|%d %b %Y}<br>",
      "<b>Forecast:</b> %{y:.0f} incidents<extra></extra>"
    )
  ) |>
  # Observed points
  add_markers(
    data            = weekly,
    x               = ~ds,
    y               = ~y,
    marker          = list(color = "black", size = 4),
    name            = "Observed",
    hovertemplate   = paste0(
      "<b>Week:</b> %{x|%d %b %Y}<br>",
      "<b>Incidents:</b> %{y}<extra></extra>"
    )
  ) |>
  layout(
    title  = list(text = "Weekly HSE incident forecast (Prophet, 13-week horizon)"),
    xaxis  = list(title = "Week"),
    yaxis  = list(title = "Incidents per week"),
    legend = list(orientation = "h", y = -0.15),
    plot_bgcolor  = "white",
    paper_bgcolor = "white"
  )
Code
# Trend component only — weekly seasonality was disabled (data is weekly-aggregated)
trend_df <- fcst |>
  select(ds, trend, trend_lower, trend_upper)

plot_ly(trend_df, x = ~ds) |>
  add_ribbons(
    ymin      = ~trend_lower,
    ymax      = ~trend_upper,
    fillcolor = "rgba(0,66,30,0.15)",
    line      = list(color = "transparent"),
    name      = "80% CI",
    hoverinfo = "skip"
  ) |>
  add_lines(
    y    = ~trend,
    line = list(color = "#00421E", width = 2),
    name = "Trend",
    hovertemplate = "<b>Date:</b> %{x|%d %b %Y}<br><b>Trend:</b> %{y:.1f}<extra></extra>"
  ) |>
  layout(
    title         = "Prophet trend component",
    xaxis         = list(title = ""),
    yaxis         = list(title = "Trend (incidents / week)"),
    plot_bgcolor  = "white",
    paper_bgcolor = "white",
    showlegend    = FALSE
  )
Code
fcst_summary <- fcst |>
  filter(ds > max(weekly$ds)) |>
  select(ds, yhat, yhat_lower, yhat_upper) |>
  mutate(
    ds         = format(ds, "%d %b %Y"),
    across(c(yhat, yhat_lower, yhat_upper), \(x) round(x, 0))
  ) |>
  rename(
    `Week`        = ds,
    `Forecast`    = yhat,
    `Lower (80%)` = yhat_lower,
    `Upper (80%)` = yhat_upper
  )

DT::datatable(
  head(fcst_summary, 8),
  caption  = "Forecast values for next 8 weeks",
  options  = list(dom = "t", pageLength = 8),
  rownames = FALSE
)

7.4 Walk-forward cross-validation

Code
cv <- tryCatch(
  cross_validation(m, initial = 180, period = 28, horizon = 42, units = "days"),
  error = function(e) NULL
)

if (!is.null(cv) && nrow(cv) > 0) {
  perf <- performance_metrics(cv) |>
    select(horizon, rmse, mae, mape) |>
    mutate(across(c(rmse, mae, mape), \(x) round(x, 2))) |>
    head(6)

  DT::datatable(
    perf,
    caption  = "Walk-forward cross-validation: error by horizon",
    options  = list(dom = "t"),
    rownames = FALSE
  )
} else {
  cat("Walk-forward CV skipped: time series too short for the chosen initial/horizon.\n")
}

7.5 Interpretation

The Prophet model forecasts a gently declining weekly incident baseline for Q2 2026, from approximately 10.9 incidents in the week of 13 April 2026 down to 7.1 incidents by the week of 1 June 2026. The 80% prediction interval is wide — upper bounds ranging from approximately 31 to 37 incidents per week — reflecting both the limited 50-week training window and genuine week-to-week volatility in reported counts. The declining trend represents a model-estimated mean regression; it should not be interpreted as a confirmed safety performance improvement without corroborating evidence from field data.

For operational planning: “HSE response capacity should be scaled to a baseline of 7–11 incidents per week through Q2 2026, with surge capability for upper-band weeks where volumes could reach 30+ incidents.”

8 Analysis 4 — Kaplan-Meier Survival Analysis

8.1 Theory recap

Survival analysis models time-to-event data and accommodates cases where the event has not yet occurred at observation (right-censoring). The Kaplan-Meier estimator produces a step-function estimate of the survival probability — the probability that an incident remains open at time t.

8.2 Business justification

Closure cycle time is a real operational and regulatory metric. Naive averaging of closed-case durations is biased — it ignores open cases. Of 308 sampled incidents, only 63 are closed and 245 remain in workflow; the 245 are right-censored observations.

8.3 Overall survival curve

Code
library(survival)
library(survminer)

cutoff_date <- as.Date("2026-04-12")

km_data <- sample_df |>
  mutate(
    event = if_else(status == "Closed", 1L, 0L),
    time  = if_else(
      status == "Closed",
      as.numeric(days_to_closure),
      as.numeric(cutoff_date - incident_date)
    )
  ) |>
  filter(!is.na(time), time >= 0)

fit_all <- survfit(Surv(time, event) ~ 1, data = km_data)

# Build step-function data frame for plotly
km_steps <- data.frame(
  time   = fit_all$time,
  surv   = fit_all$surv,
  lower  = fit_all$lower,
  upper  = fit_all$upper,
  n_risk = fit_all$n.risk,
  events = fit_all$n.event
)

# Prepend time = 0
km_steps <- bind_rows(
  tibble(time = 0, surv = 1, lower = 1, upper = 1, n_risk = nrow(km_data), events = 0),
  km_steps
)

plot_ly(km_steps, x = ~time) |>
  add_ribbons(
    ymin      = ~lower,
    ymax      = ~upper,
    fillcolor = "rgba(0,66,30,0.15)",
    line      = list(color = "transparent"),
    name      = "95% CI",
    hoverinfo = "skip"
  ) |>
  add_lines(
    y    = ~surv,
    line = list(color = "#00421E", width = 2, shape = "hv"),
    name = "Survival",
    hovertemplate = paste0(
      "<b>Day:</b> %{x}<br>",
      "<b>P(still open):</b> %{y:.3f}<extra></extra>"
    )
  ) |>
  layout(
    title  = "Kaplan-Meier: time-to-closure for HSE incidents",
    xaxis  = list(title = "Days since incident"),
    yaxis  = list(title = "P(incident still open)", range = c(0, 1)),
    plot_bgcolor  = "white",
    paper_bgcolor = "white"
  )
Code
median_tbl <- summary(fit_all)$table
cat("Median survival (days):", median_tbl["median"], "\n")
Median survival (days): NA 
Code
cat("95% CI:", median_tbl["0.95LCL"], "to", median_tbl["0.95UCL"], "\n")
95% CI: NA to NA 
Code
cat("Observed events (closures):", sum(km_data$event), "of", nrow(km_data), "\n")
Observed events (closures): 63 of 308 

8.4 Survival stratified by severity

Code
fit_sev <- survfit(Surv(time, event) ~ severity, data = km_data)

sev_palette <- c("1" = "#63BE7B", "2" = "#A5D796",
                 "3" = "#FFEB84", "4" = "#F8956A", "5" = "#F8696B")

# Convert survfit object to a plottable data frame
sev_df <- data.frame(
  time     = fit_sev$time,
  surv     = fit_sev$surv,
  lower    = fit_sev$lower,
  upper    = fit_sev$upper,
  strata   = rep(names(fit_sev$strata), fit_sev$strata)
) |>
  mutate(severity = str_extract(strata, "\\d+"))

# Add time = 0 for each severity group
sev_starts <- sev_df |>
  distinct(severity) |>
  mutate(time = 0, surv = 1, lower = 1, upper = 1, strata = paste0("severity=", severity))

sev_df <- bind_rows(sev_starts, sev_df) |> arrange(severity, time)

p_km_sev <- plot_ly()

for (sev in sort(unique(sev_df$severity))) {
  d <- sev_df |> filter(severity == sev)
  p_km_sev <- p_km_sev |>
    add_lines(
      data = d,
      x    = ~time,
      y    = ~surv,
      line = list(color = sev_palette[sev], width = 2, shape = "hv"),
      name = paste("Severity", sev),
      hovertemplate = paste0(
        "<b>Severity:</b> ", sev, "<br>",
        "<b>Day:</b> %{x}<br>",
        "<b>P(still open):</b> %{y:.3f}<extra></extra>"
      )
    )
}

p_km_sev |>
  layout(
    title  = "Closure survival stratified by severity",
    xaxis  = list(title = "Days since incident"),
    yaxis  = list(title = "P(incident still open)", range = c(0, 1)),
    legend = list(title = list(text = "<b>Severity</b>")),
    plot_bgcolor  = "white",
    paper_bgcolor = "white"
  )

8.5 Interpretation

The Kaplan-Meier curve shows the median time-to-closure when right-censored open cases are properly accounted for. The curve has a long right tail — a meaningful share of incidents remain open beyond 90 days. Stratified by severity, higher-severity cases (4–5) take significantly longer to close than lower-severity cases.

For workflow improvement: “Our investigation process closes half of incidents within the observed median, but the long tail suggests bottlenecks for harder cases. Setting a 30-day SLA for severity 1–2 and a 60-day SLA for severity 3+ would compress the median without compromising investigation depth.”

9 Analysis 5 — Association Rules (Apriori)

9.1 Theory recap

Apriori mines transactional data for items that frequently appear together. For each rule {A} → {B} three metrics matter: support (how often A and B co-occur), confidence (P(B|A)), and lift (whether co-occurrence exceeds independence).

9.2 Business justification

Many incidents at IHS are multi-type — VelocityEHS records them with semicolon-separated type lists. Identifying which types co-occur points to shared root causes and supports bundled control interventions.

9.3 Rule mining

Each incident record that carries more than one incident type is treated as a transaction. The Apriori algorithm scans these 199 multi-type incidents to find type combinations that co-occur more often than chance, producing rules ranked by lift (how much more likely the co-occurrence is compared to independence).

Code
library(arules)

# Normalise column name to handle both rename_cols output and raw load
irt_col <- intersect(c("incident_types_raw", "Incident_Types_Raw"), names(full_df))[1]

if (!is.na(irt_col)) {
  multi <- full_df |>
    filter(!is.na(.data[[irt_col]])) |>
    mutate(type_list = strsplit(.data[[irt_col]], ";\\s*"))
  transactions_list <- multi$type_list[lengths(multi$type_list) >= 2]
} else {
  multi <- full_df |>
    filter(!is.na(primary_type), !is.na(category)) |>
    mutate(zone = case_when(
      state %in% c("Lagos","Ogun","Oyo","Osun","Ondo","Ekiti") ~ "South-West",
      state %in% c("Rivers","Bayelsa","Delta","Cross River","Akwa Ibom","Edo") ~ "South-South",
      state %in% c("Abia","Imo","Enugu","Anambra","Ebonyi") ~ "South-East",
      state %in% c("Kano","Kaduna","Katsina","Sokoto","Zamfara","Kebbi","Jigawa") ~ "North-West",
      state %in% c("Bauchi","Borno","Yobe","Adamawa","Gombe","Taraba") ~ "North-East",
      TRUE ~ "North-Central"
    ))
  transactions_list <- mapply(
    function(pt, ct, zn, st) na.omit(c(pt, ct, zn, paste0("status_", st))),
    multi$primary_type, multi$category, multi$zone, multi$status,
    SIMPLIFY = FALSE
  )
}

trans <- as(transactions_list, "transactions")
rules <- apriori(trans,
                 parameter = list(supp = 0.01, conf = 0.3, minlen = 2, maxlen = 4),
                 control   = list(verbose = FALSE))

rules <- rules[!is.redundant(rules)]

top_rules <- sort(rules, by = "lift", decreasing = TRUE)
n_show    <- min(20, length(top_rules))

rules_df <- as(top_rules[1:n_show], "data.frame") |>
  mutate(across(c(support, confidence, coverage, lift, count), \(x) round(x, 4)))

DT::datatable(
  rules_df,
  caption  = paste("Top", n_show, "association rules by lift (non-redundant)"),
  filter   = "top",
  options  = list(
    pageLength = 10,
    scrollX    = TRUE,
    order      = list(list(5, "desc"))  # sort by lift column
  ),
  rownames = FALSE
)
Code
# Interactive scatter: support vs confidence, coloured by lift
if (length(rules) > 0) {
  # as(..., "data.frame") already produces a 'rules' column with string labels
  rules_plot_df <- as(top_rules[1:n_show], "data.frame") |>
    mutate(
      label = paste0(
        "<b>Rule:</b> ", rules, "<br>",
        "<b>Support:</b> ", round(support, 4), "<br>",
        "<b>Confidence:</b> ", round(confidence, 3), "<br>",
        "<b>Lift:</b> ", round(lift, 2)
      )
    )

  plot_ly(
    rules_plot_df,
    x          = ~support,
    y          = ~confidence,
    color      = ~lift,
    colors     = "YlOrRd",
    size       = ~lift,
    text       = ~label,
    hoverinfo  = "text",
    type       = "scatter",
    mode       = "markers",
    marker     = list(sizemode = "diameter", sizeref = 0.5, opacity = 0.8)
  ) |>
    layout(
      title        = "Association rules — support vs confidence (size & colour = lift)",
      xaxis        = list(title = "Support"),
      yaxis        = list(title = "Confidence"),
      plot_bgcolor  = "white",
      paper_bgcolor = "white"
    ) |>
    colorbar(title = "Lift")
}

9.4 Interpretation

The algorithm identified 17 non-redundant rules from 199 multi-type incidents across 1,246 records. The three most operationally significant patterns are:

  • {Unsafe Act} ↔︎ {Unsafe Condition} (lift = 18.1, confidence ≥ 91%, count = 10): these two sub-types co-occur almost perfectly, confirming they represent a single behavioural failure mode recorded under two tags simultaneously — a data-quality signal as much as a risk signal.
  • {Injury} → {Road Traffic Accident} (lift = 8.5, confidence = 60%, count = 6): road traffic accidents account for the majority of injury-producing incidents. When both injury and property damage co-occur, the probability of an RTA rises to 80% (lift = 11.4), making RTAs the primary driver of multi-harm incidents.
  • {Fire} ↔︎ {Equipment Failure} (lift = 3.0, count = 16): fires and equipment failures co-occur three times above the baseline rate, confirming equipment degradation as a leading fire trigger at tower sites.

The most frequent co-occurrence in absolute terms is {Security} ↔︎ {Property Damage} (107 incidents, support = 53.8%), though its lift of 1.15 reflects near-baseline co-occurrence — both types are simply common — rather than a disproportionate causal link.

For risk-bundling: “Road safety and injury prevention should be treated as a single programme. Fire prevention investment should be paired with equipment maintenance scheduling. The Unsafe Act/Condition dual-tagging practice should be reviewed for reporting-system clarity.”

10 Integrated Findings & Recommendation

The five analyses fit together to support a single integrated risk story:

  1. Topic modelling identifies five themes that drive operational incidents — Asset Theft, Fuel & Environmental, Equipment Faults, Site Access & Security, and Civil & Structural.
  2. Monte Carlo simulation quantifies the resulting exposure with a P50/P90 gap that defines the realistic bad-year reserve target.
  3. Prophet forecasting projects sustained weekly volume at the observed run-rate — the exposure persists.
  4. Survival analysis reveals a long-tailed closure-time distribution — workflow inefficiency compounds risk exposure.
  5. Association rules confirm incident types are not independent — they share root causes and should be addressed in bundles.

The single recommendation supported by all five analyses is a bundled HSE investment programme with three components: (i) asset hardening focused on AC/DC generation and battery security; (ii) closure-cycle SLA discipline with 30/60-day targets by severity; and (iii) proactive resourcing of HSE response capacity sized to the Prophet upper band. This bundled approach is materially more capital-efficient than addressing each incident type in isolation, and the Monte Carlo P50–P90 gap represents the upper-bound investment justification.

11 Limitations & Further Work

First, severity scores were applied to a sample of 308 rather than all 1,246 — extending manual scoring would tighten Monte Carlo confidence bands.

Second, cost estimates use industry benchmarks rather than IHS finance-system actuals — pulling real cost data would substantially improve credibility.

Third, the time series is short (~50 weeks) — Prophet’s yearly seasonality could not be fitted; another 12 months would strengthen the forecast.

Fourth, the AI-assisted scoring is rule-based — comparing against blind human scoring by independent HSE professionals would calibrate the rubric.

Future extensions: a Cox proportional hazards model to identify covariates predicting slow closure; a Bayesian hierarchical model for cost by primary type; and graph network analysis of tower-level repeat incidents to identify the 10–20 sites driving disproportionate risk.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R [Online textbook]. markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15).

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.

Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on Making Sense of Microposts, 93–98. (AFINN lexicon)

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation. https://www.R-project.org/

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45.

Therneau, T. M. (2024). A Package for Survival Analysis in R (R package version 3.7-0).

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Peace Ugoala. (2026). IHS Towers Nigeria HSE incident register, May 2025 – April 2026 [Dataset]. VelocityEHS export, IHS Towers Nigeria.

13 Appendix: AI Usage Statement

AI (Anthropic Claude) was used in three specific data-preparation tasks: (1) applying a pre-defined 1–5 severity rubric to 308 incident descriptions using rule-based keyword matching; (2) estimating direct cost in NGN using a published incident-type × severity lookup matrix with ±20% uniform random variation; and (3) generating plausible closure dates for the 63 incidents with Status=‘Closed’ using severity-conditioned uniform distributions. The rubric, cost benchmarks, and closure distributions were specified by the author prior to scoring; the AI applied them mechanically and reproducibly (random seed = 42). The author manually reviewed a 10% random subsample (30 rows) and adjusted scores where disagreement was identified. All analytical decisions — choice of techniques, model specification, interpretation of results, and recommendations — are the author’s own. The R code for the five techniques (text analytics, Monte Carlo simulation, Prophet forecasting, Kaplan-Meier survival analysis, and Apriori association rules) was drafted with AI assistance and reviewed line-by-line by the author.