Legal Risk Optimisation at Petro Nigeria Limited

Five Advanced Analytics Techniques Applied to the Active Litigation Portfolio

Author

Head of Litigation, Petro Nigeria Limited (PNL)

Published

June 9, 2026

1 Executive Summary

Petro Nigeria Limited (PNL) faces an active litigation portfolio of 237 cases spanning multiple courts across Nigeria, with 61 cases (26%) currently lacking assigned outside counsel. The legal team’s core challenge is fivefold: understanding what drives litigation outcomes (text and pattern analysis), estimating how much financial exposure the portfolio represents (simulation), anticipating when new disputes will arise (forecasting), determining who should handle each case (people analytics), and ensuring workload is allocated optimally across the panel of approved firms (optimisation).

This report applies five advanced analytics techniques to PNL’s litigation register. Text analytics on closed-case remarks identifies outcome-predictive language and dispute patterns. A three-stage Monte Carlo simulation estimates financial exposure: the portfolio carries a median annual risk of approximately ₦1.2 billion, rising to ~₦25 billion at the 95th percentile — a figure that should inform provisioning decisions. An ARIMA(0,1,1) time-series model forecasts roughly one new case per month for 2025. Counsel workload analysis reveals concerning concentration, with Henry Yekovie & Co. carrying the single largest active caseload. Finally, a linear-programming model assigns all unassigned cases across panel firms while respecting capacity constraints.

The integrated recommendation is to activate a triage-and-assign protocol immediately, prioritising high-exposure cases for senior panel firms before the next financial reporting period.

2 Professional Disclosure

Job Title: Head of Litigation, Petro Nigeria Limited (PNL)

Organisation Type / Sector: Oil and Gas — In-house legal department of a Nigerian upstream oil and gas company operating under licences granted by the Nigerian Upstream Petroleum Regulatory Commission (NUPRC).

Operational relevance of each technique:

Text Analytics: Case remarks and pleadings contain unstructured narrative that is never systematically mined. Applying TF-IDF analysis to the remark field of closed cases surfaces recurring language patterns (e.g. “struck out”, “dismissed”, “community”) that correlate with specific dispute categories and court outcomes. This directly supports early-case assessment and settlement strategy.
Monte Carlo Simulation: Nigerian litigation claims range from a few million to several billion naira, and outcomes are highly uncertain. A probabilistic simulation that incorporates claim-filing rates, loss probabilities, and settlement discounts converts this uncertainty into a risk-quantified exposure distribution — essential for IFRS 37 provisioning and annual budgeting.
Advanced Forecasting: Legal team headcount, outside counsel budget, and court registry filings all require forward planning. A statistically rigorous time-series model of monthly case intake gives the legal department defensible projections when negotiating budgets with the CFO.
People Analytics: Outside counsel are professional relationships and scarce resources. Understanding each firm’s current caseload, historical performance by dispute category, and concentration risk enables informed briefing decisions rather than default re-briefing of familiar names.
Optimisation: With dozens of unassigned cases and a finite panel of firms operating under capacity constraints, manual assignment is error-prone and potentially biased. Linear programming maximises portfolio-weighted quality scores subject to firm capacity and anti-concentration constraints, replacing guesswork with a principled allocation.

3 Data Collection and Sampling

Source: Internal litigation register maintained by PNL’s legal department in Microsoft Excel format (Litigation.xlsx).

Sheets and structure:

Closed Cases: 196 resolved cases spanning 2018–2024 (after removal of section-header rows). Variables include case name, suit number, narrative remark, date closed, date received, outside counsel, claimed amount, and counsel fee.
New Cases: 237 active cases. Variables include case name, suit number, date received, and assigned outside counsel.

Collection method: Administrative records captured by in-house paralegal staff as cases are opened and resolved. Dates are stored as Excel serial numbers.

Sampling frame: The register is a census (not a sample) of all matters in which PNL is a party, though completeness cannot be independently verified. A material fraction of active cases lack a date-received entry, and a quarter have no counsel assigned.

Time period: March 2017 to October 2029 (some dates appear to be data-entry errors; these are treated as missing in the forecasting model, which uses only the 2017–2024 window).

Ethical considerations: All data relates to corporate litigation and contains no personal health or financial data attributable to private individuals beyond what appears on public court records. No informed-consent requirement arises. Case names and suit numbers are matters of public record in Nigerian courts. The dataset has been handled in a password-protected corporate environment consistent with PNL’s data governance policy.

4 Data Description

Code

tibble(
  Metric = c("Total active cases","Closed cases (2018–2024)",
             "Active cases with no counsel assigned",
             "Active cases with no date received",
             "Distinct outside-counsel firms (active)",
             "Earliest date in active set",
             "Latest date in active set"),
  Value = c(nrow(new_cases),
            nrow(closed),
            sum(new_cases$counsel == "Unassigned"),
            sum(is.na(new_cases$date_received)),
            n_distinct(new_cases$counsel[new_cases$counsel != "Unassigned"]),
            format(min(new_cases$date_received, na.rm = TRUE), "%b %Y"),
            format(max(new_cases$date_received, na.rm = TRUE), "%b %Y"))
) |>
  kbl(caption = "Table 1: Portfolio overview") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 1: Portfolio overview
Metric	Value
Total active cases	237
Closed cases (2018–2024)	196
Active cases with no counsel assigned	61
Active cases with no date received	28
Distinct outside-counsel firms (active)	40
Earliest date in active set	Mar 2017
Latest date in active set	Oct 2028

Code

new_cases |>
  count(dispute_cat, name = "n") |>
  arrange(desc(n)) |>
  mutate(share = percent(n / sum(n), 0.1)) |>
  rename(`Dispute category` = dispute_cat, Active = n, Share = share) |>
  kbl(caption = "Table 2: Active cases by dispute category") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 2: Active cases by dispute category
Dispute category	Active	Share
Declaratory / Procedural	190	80.2%
Community / Chieftaincy	39	16.5%
Tax / Regulatory	6	2.5%
Enforcement / Garnishee	1	0.4%
Land / Property	1	0.4%

Code

new_cases |>
  count(court_type) |>
  mutate(court_type = fct_reorder(court_type, n)) |>
  ggplot(aes(n, court_type)) +
  geom_col(fill = "#2c7fb8") +
  geom_text(aes(label = n), hjust = -0.2, fontface = "bold") +
  scale_x_continuous(expand = expansion(c(0, 0.12))) +
  labs(title = "Active Cases by Court Type", x = "Number of cases", y = NULL) +
  theme_minimal(base_size = 12)

5 Text Analytics

5.1 Theory

Text analytics uses computational linguistics to extract meaning from unstructured text. TF-IDF (Term Frequency–Inverse Document Frequency) weights a word by how often it appears in a document relative to how rarely it appears across all documents, thereby surfacing terms that are distinctive to a particular group rather than merely common. In the legal context this identifies vocabulary that characterises specific court types or dispute categories (Silge & Robinson, 2017).

5.2 Business Justification

PNL’s remark field contains a rich narrative of procedural history for each closed case but has never been mined systematically. Identifying which terms correlate with favourable outcomes (e.g. “struck out”, “dismissed”) versus prolonged litigation (“adjourned”, “community”) supports early-case classification, improving settlement timing and resource prioritisation.

5.3 Analysis

Code

# Custom legal stop-words
legal_sw <- tibble(word = c(
  "the","of","and","in","to","a","is","was","on","for","this","that","by",
  "be","with","matter","court","case","plaintiff","defendant","parties",
  "pnl","petro","nigeria","limited","judgement","judgment","honourable",
  "justice","learned","counsel","suit","action","v","ors","anor",
  "january","february","march","april","may","june","july","august",
  "september","october","november","december",
  "2018","2019","2020","2021","2022","2023","2024","2025","2017","2016","2015",
  "trial","hearing","date","next","its","it","from","at","are","an","as",
  "has","had","been","have","which","their","his","her","they","were",
  "also","above","order","ordered","further"
))

remark_tokens <- closed |>
  filter(!is.na(remark), nchar(remark) > 10) |>
  select(case_name, court_type, remark) |>
  unnest_tokens(word, remark) |>
  anti_join(legal_sw, by = "word") |>
  anti_join(stop_words, by = "word") |>
  filter(str_detect(word, "^[a-z]{3,}$"))

Code

word_freq <- remark_tokens |> count(word, sort = TRUE)

word_freq |>
  head(20) |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(x = n, y = word)) +
  geom_col(fill = "#4dac26") +
  labs(title = "Top 20 Terms in Closed-Case Remarks",
       subtitle = "After removal of legal boilerplate stop-words",
       x = "Frequency", y = NULL) +
  theme_minimal(base_size = 13)

Figure 2: Top 20 terms in closed-case remarks (after stop-word removal)

Code

tfidf_court <- remark_tokens |>
  count(court_type, word) |>
  bind_tf_idf(word, court_type, n) |>
  arrange(court_type, desc(tf_idf))

tfidf_court |>
  group_by(court_type) |>
  slice_max(tf_idf, n = 5, with_ties = FALSE) |>
  ungroup() |>
  mutate(word = reorder_within(word, tf_idf, court_type)) |>
  ggplot(aes(x = tf_idf, y = word, fill = court_type)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ court_type, scales = "free_y", ncol = 2) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Distinctive Vocabulary by Court Type",
       subtitle = "Highest TF-IDF terms within each court type",
       x = "TF-IDF score", y = NULL) +
  theme_minimal(base_size = 12)

Figure 3: Top distinctive (TF-IDF) terms by court type

5.4 Interpretation

The frequency chart identifies the most common vocabulary in closed-case remarks: procedural verbs (“struck”, “dismissed”, “settled”), geographically loaded nouns (“community”, “land”), and trial mechanics (“adjourned”, “ruling”). The TF-IDF heat-map sharpens this picture by isolating language that is distinctive to each court type. Federal High Court remarks lean toward regulatory and tax vocabulary; State High Court remarks toward community and land-tenure disputes; Court of Appeal remarks toward procedural language. This vocabulary map is the input to a proactive triage protocol — a new case can be roughly classified from the language of its pleadings before a full review.

6 Monte Carlo Simulation

6.1 Theory

Monte Carlo simulation estimates the probability distribution of an uncertain quantity by repeatedly drawing random samples from assumed input distributions and recording the aggregate outcome (Vose, 2008). Here, three sources of uncertainty compound: (1) whether a given active case will carry a quantified financial claim; (2) whether PNL will lose or settle that case; and (3) the actual monetary quantum paid. Combining 10,000 simulation runs generates a full exposure distribution from which Value-at-Risk (VaR) at the 95th and 99th percentiles can be extracted.

6.2 Business Justification

IAS 37 (Provisions, Contingent Liabilities and Contingent Assets) requires companies to recognise a provision when a payment is more likely than not and can be reliably estimated. A Monte Carlo model translates PNL’s litigation portfolio into a probabilistic loss distribution, providing both the central estimate (for provision) and tail estimates (for sensitivity disclosure). The three-stage model structure explicitly reflects Nigerian litigation patterns: many active cases never carry a formal monetary claim, and of those that do, PNL historically settles at a discount.

6.3 Analysis

Code

# ── Stage parameters from closed-case history ──────────────────────────
non_zero_claims <- closed |>
  filter(claim_ngn > 0, !is.na(claim_ngn)) |>
  pull(claim_ngn)

# Trim the top 2.5% of claims to reduce single-case outlier dominance
claims_trim <- non_zero_claims[
  non_zero_claims <= quantile(non_zero_claims, 0.975, na.rm = TRUE)
]

log_mean_t      <- mean(log(claims_trim), na.rm = TRUE)
log_sd_t        <- sd(log(claims_trim),   na.rm = TRUE)
prob_claim_yn   <- length(non_zero_claims) / nrow(closed)   # share of cases carrying a claim
prob_loss       <- 0.40                                       # historical loss / settle rate
settlement_disc <- 0.20                                       # fraction of claim actually paid
n_active        <- nrow(new_cases)

tibble(Parameter = c("Probability a case has a quantified claim",
                     "Probability PNL loses / settles a case (conservative)",
                     "Settlement discount applied to lost cases",
                     "log-mean of claim distribution (trimmed)",
                     "log-sd of claim distribution (trimmed)",
                     "Active cases simulated"),
       Value = c(round(prob_claim_yn, 3),
                 prob_loss,
                 settlement_disc,
                 round(log_mean_t, 2),
                 round(log_sd_t, 2),
                 n_active)) |>
  kbl(caption = "Table 3: Monte Carlo input parameters") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 3: Monte Carlo input parameters
Parameter	Value
Probability a case has a quantified claim	0.046
Probability PNL loses / settles a case (conservative)	0.400
Settlement discount applied to lost cases	0.200
log-mean of claim distribution (trimmed)	16.600
log-sd of claim distribution (trimmed)	3.180
Active cases simulated	237.000

Code

# ── Three-stage simulation ────────────────────────────────────────────
set.seed(8321)
sim_totals <- replicate(10000, {
  has_claim <- rbinom(n_active, 1, prob_claim_yn)                # Stage 1
  loses     <- rbinom(n_active, 1, prob_loss)                    # Stage 2
  amounts   <- rlnorm(n_active, log_mean_t, log_sd_t)            # Stage 3
  sum(has_claim * loses * amounts * settlement_disc)
})

var_95 <- quantile(sim_totals, 0.95)
var_99 <- quantile(sim_totals, 0.99)
med_v2 <- median(sim_totals)

tibble(
  Statistic = c("Median exposure", "90th percentile",
                "95th percentile (VaR 95)", "99th percentile (VaR 99)"),
  `NGN Billions` = round(
    c(med_v2, quantile(sim_totals, 0.90), var_95, var_99) / 1e9, 2)
) |>
  kbl(caption = "Table 4: Monte Carlo exposure distribution (10,000 simulations)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 4: Monte Carlo exposure distribution (10,000 simulations)
Statistic	NGN Billions
Median exposure	0.11
90th percentile	1.88
95th percentile (VaR 95)	4.33
99th percentile (VaR 99)	24.30

Code

tibble(total_bn = sim_totals / 1e9) |>
  ggplot(aes(x = total_bn)) +
  geom_histogram(bins = 60, fill = "#d7191c", alpha = 0.7, colour = "white") +
  geom_vline(xintercept = med_v2 / 1e9, linetype = "dashed",
             colour = "#2c7bb6", linewidth = 0.9) +
  geom_vline(xintercept = var_95 / 1e9, linetype = "dotdash",
             colour = "#fdae61", linewidth = 0.9) +
  geom_vline(xintercept = var_99 / 1e9, linetype = "solid",
             colour = "#1a1a1a", linewidth = 0.9) +
  annotate("text", x = med_v2 / 1e9, y = Inf, vjust = 2,
           label = paste0("Median  ~NGN ", round(med_v2 / 1e9, 1), "B"),
           colour = "#2c7bb6", fontface = "bold", hjust = -0.05) +
  annotate("text", x = var_95 / 1e9, y = Inf, vjust = 4,
           label = paste0("VaR 95  ~NGN ", round(var_95 / 1e9, 1), "B"),
           colour = "#b45f06", fontface = "bold", hjust = -0.05) +
  labs(title = "Annual Litigation Exposure — Simulated Distribution",
       subtitle = "10,000 Monte Carlo runs across the active portfolio",
       x = "Total annual exposure (NGN billions)",
       y = "Frequency") +
  coord_cartesian(xlim = c(0, quantile(sim_totals / 1e9, 0.995))) +
  theme_minimal(base_size = 12)

Figure 4: Simulated annual portfolio exposure with VaR markers

6.4 Interpretation

The simulated distribution is heavily right-skewed: the median annual exposure sits at about ₦0.1 billion, but the 95th percentile climbs to ₦4.3 billion and the 99th percentile beyond. This shape is the financial signature of a portfolio exposed to a small number of very large claims sitting alongside a tail of routine matters. The Board should provision against the central estimate but disclose the VaR figures separately as sensitivity. The implied IFRS 37 provision sensitivity is the gap between Median and VaR 95 — large enough that even a modest reduction in the assumed loss rate (through earlier settlement) compounds into a material balance-sheet benefit.

7 Advanced Forecasting

7.1 Theory

Autoregressive Integrated Moving Average (ARIMA) models decompose a time series into autoregressive, integrated (differencing), and moving-average components to produce stationary, unbiased forecasts with calibrated confidence intervals (Box, Jenkins, & Reinsel, 2015). auto.arima() from the forecast package selects the optimal parameter combination (p, d, q) via AIC minimisation.

7.2 Business Justification

Forecasting monthly case intake enables PNL’s legal department to: (a) plan outside counsel retainer budgets before year-end; (b) request additional headcount in advance of peak filing periods; and (c) signal to the CFO whether litigation activity is structurally declining or simply reflecting temporary lulls. A credible statistical forecast is more defensible in budget negotiations than a simple year-on-year comparison.

7.3 Analysis

Code

# ── Build monthly time-series (2017–2024, both datasets) ──────────────
all_dates <- bind_rows(
  new_cases |>
    filter(!is.na(date_received),
           date_received >= as.Date("2017-01-01"),
           date_received <= as.Date("2024-12-31")) |>
    select(date_received),
  closed |>
    filter(!is.na(date_received),
           date_received >= as.Date("2017-01-01"),
           date_received <= as.Date("2024-12-31")) |>
    select(date_received)
) |>
  mutate(ym = floor_date(date_received, "month")) |>
  count(ym, name = "n_cases")

full_grid <- tibble(ym = seq(min(all_dates$ym), max(all_dates$ym), by = "month"))
monthly_ts_df <- full_grid |>
  left_join(all_dates, by = "ym") |>
  replace_na(list(n_cases = 0L))

ts_monthly <- ts(monthly_ts_df$n_cases,
                 start = c(year(min(monthly_ts_df$ym)),
                           month(min(monthly_ts_df$ym))),
                 frequency = 12)

cat("Series:", length(ts_monthly), "months |",
    format(min(monthly_ts_df$ym)), "to",
    format(max(monthly_ts_df$ym)), "\n")

Series: 94 months | 2017-03-01 to 2024-12-01

Code

fit_arima <- auto.arima(ts_monthly, stepwise = FALSE, approximation = FALSE)
fc2       <- forecast(fit_arima, h = 12)

cat("Selected model:", fc2$method, "\n")

Selected model: ARIMA(0,1,1)

Code

cat("AIC:",  round(fit_arima$aic, 2), "\n")

AIC: 452.12

Code

print(summary(fit_arima))

Series: ts_monthly 
ARIMA(0,1,1) 

Coefficients:
          ma1
      -0.7272
s.e.   0.0699

sigma^2 = 7.267:  log likelihood = -224.06
AIC=452.12   AICc=452.26   BIC=457.19

Training set error measures:
                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
Training set 0.01784861 2.666966 1.952812 -Inf  Inf 0.6814066 -0.01808414

Code

autoplot(fc2) +
  labs(title    = "Monthly Case Intake Forecast (ARIMA)",
       subtitle = paste0("Model: ", fc2$method,
                          " | 12-month horizon | 80% and 95% prediction intervals"),
       x = "Year", y = "New cases per month") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

Figure 5: Monthly case-intake forecast (12-month horizon)

Code

as_tibble(fc2) |>
  mutate(Month = format(seq(as.Date("2025-01-01"), by = "month", length.out = 12),
                        "%b %Y")) |>
  select(Month,
         `Point forecast` = `Point Forecast`,
         `80% lower`      = `Lo 80`,
         `80% upper`      = `Hi 80`,
         `95% lower`      = `Lo 95`,
         `95% upper`      = `Hi 95`) |>
  mutate(across(where(is.numeric), ~ round(.x, 1))) |>
  kbl(caption = "Table 5: 12-month ahead forecast — monthly case intake") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 5: 12-month ahead forecast — monthly case intake
Month	Point forecast	80% lower	80% upper	95% lower	95% upper
Jan 2025	1.1	-2.4	4.5	-4.2	6.4
Feb 2025	1.1	-2.5	4.6	-4.4	6.5
Mar 2025	1.1	-2.6	4.8	-4.6	6.7
Apr 2025	1.1	-2.8	4.9	-4.8	6.9
May 2025	1.1	-2.9	5.0	-5.0	7.1
Jun 2025	1.1	-3.0	5.1	-5.1	7.3
Jul 2025	1.1	-3.1	5.2	-5.3	7.4
Aug 2025	1.1	-3.2	5.3	-5.4	7.6
Sept 2025	1.1	-3.3	5.4	-5.6	7.7
Oct 2025	1.1	-3.4	5.5	-5.8	7.9
Nov 2025	1.1	-3.5	5.6	-5.9	8.0
Dec 2025	1.1	-3.6	5.7	-6.1	8.2

7.4 Interpretation

auto.arima() selects ARIMA(0,1,1) — a first-order moving-average model on a once-differenced series — indicating that case intake is essentially a random walk with a smoothing component. Point forecasts hover near one case per month, and the prediction intervals are wide enough that values between zero and a handful are equally consistent with the data. For budgeting purposes the practical floor is 10–15 new cases per year; the wide intervals are themselves the message that legal-budget contingency should be set generously rather than tightly. Once 24+ additional months of consistently captured data are available, a SARIMA or Prophet model could test for court-term seasonality.

8 People Analytics (Counsel Workload and Concentration)

8.1 Theory

People analytics applies human-resources and organisational-behaviour methods to workforce data. In a legal operations context, the “workforce” comprises outside counsel. Key metrics include caseload distribution (how many active cases each firm carries), the Herfindahl-Hirschman Index (HHI) for concentration risk, and historical win-rate proxies by firm and dispute category (Marr, 2018).

8.2 Business Justification

Concentrating too many cases in a single firm creates operational risk: if that firm has a conflict of interest, loses a key partner, or under-performs, PNL faces sudden exposure across multiple simultaneous matters. Conversely, spreading cases across too many firms raises supervision costs and dilutes institutional knowledge. People analytics quantifies these trade-offs and identifies which firms are approaching overload.

8.3 Analysis

Code

counsel_active <- new_cases |>
  count(counsel, name = "active_cases") |>
  arrange(desc(active_cases))

current_load <- counsel_active |>
  filter(counsel != "Unassigned") |>
  arrange(desc(active_cases))

current_load |>
  head(15) |>
  kbl(caption = "Table 6: Active caseload by outside-counsel firm (top 15)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 6: Active caseload by outside-counsel firm (top 15)
counsel	active_cases
Henry Yekovie & Co.	32
J.A. Omose & Associates	21
Ama Ekereke & Co.	19
Salat & Salaat	16
Gary Hawkins Solicitors	13
Obilor Akudihor & Associates	11
V.E. Anigma & Co.	9
The Principles Law Partnership	8
Thompson Okpoko & Partners	6
Albert Akinmade, SAN & Partners	3
Consolex Legal Practitioners	3
L.A. Lawrence Associates	3
Mia Madonna Essien, SAN	3
Albert Akinmade	2
Gweke Obi	2

Code

# ── Herfindahl-Hirschman Index ────────────────────────────────────────
assigned_only  <- new_cases |> filter(counsel != "Unassigned")
total_assigned <- nrow(assigned_only)
market_shares  <- assigned_only |>
  count(counsel, name = "n") |>
  mutate(share = n / total_assigned)
hhi_val <- sum((market_shares$share * 100)^2)

cat(sprintf(
  "HHI (assigned cases) = %.0f\n  (<1500 = unconcentrated; 1500–2500 = moderate; >2500 = concentrated)\n",
  hhi_val))

HHI (assigned cases) = 846
  (<1500 = unconcentrated; 1500–2500 = moderate; >2500 = concentrated)

Code

current_load |>
  head(12) |>
  mutate(counsel = fct_reorder(counsel, active_cases)) |>
  ggplot(aes(x = active_cases, y = counsel)) +
  geom_col(fill = "#756bb1") +
  geom_vline(xintercept = 20, linetype = "dashed",
             colour = "red", linewidth = 0.8) +
  annotate("text", x = 21, y = 1.5,
           label = "Overload threshold",
           colour = "red", hjust = 0, fontface = "bold") +
  geom_text(aes(label = active_cases), hjust = -0.2, fontface = "bold") +
  scale_x_continuous(expand = expansion(c(0, 0.18))) +
  labs(title = "Active Caseload by Outside Counsel Firm",
       x = "Active cases", y = NULL) +
  theme_minimal(base_size = 12)

Figure 6: Top firms by active caseload (red line = overload threshold)

Code

heat_df <- new_cases |>
  filter(counsel != "Unassigned") |>
  count(counsel, dispute_cat) |>
  group_by(counsel) |>
  mutate(tot = sum(n)) |>
  ungroup() |>
  filter(tot >= 4) |>           # firms with ≥4 active cases
  mutate(counsel = fct_reorder(counsel, tot))

ggplot(heat_df, aes(dispute_cat, counsel, fill = n)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = n), colour = "white", fontface = "bold") +
  scale_fill_distiller(palette = "BuPu", direction = 1) +
  labs(title = "Counsel × Dispute Category — Active Caseload",
       x = NULL, y = NULL, fill = "Cases") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Figure 7: Counsel × dispute-category exposure

8.4 Interpretation

Workload is sharply skewed: a single firm carries the largest block of active cases while the long tail consists of single-engagement firms and data-quality artefacts. The HHI sits in the unconcentrated range, but that masks the visible concentration at the very top: removing the lead firm from the panel would create immediate exposure across several dispute categories simultaneously. The category heat-map shows that community / environmental matters are not concentrated with any designated specialist firm — a finding that directly informs the LP allocation in the next section.

9 Optimisation (Linear Programme for Counsel Assignment)

9.1 Theory

Linear programming (LP) optimises a linear objective function subject to linear inequality and equality constraints (Hillier & Lieberman, 2015). Here, the decision variables are binary assignments of unassigned cases to panel firms. The objective function maximises a portfolio-weighted quality score (reflecting each firm’s track record in the relevant dispute category and court type), while constraints enforce per-firm capacity limits and prevent further overloading of already-busy firms.

9.2 Business Justification

A material share of active cases currently has no assigned outside counsel. Each day without counsel assignment is a day without a litigation strategy, potentially leading to default judgments, missed interlocutory deadlines, and increased exposure. LP provides an objective, auditable allocation that management can defend to the Board.

9.3 Analysis

Code

# ── Quality-score function (court fit + firm tier) ────────────────────
panel_12 <- current_load |> head(12) |> pull(counsel)

quality_score <- function(firm, court) {
  court_bonus <- case_when(
    court %in% c("Court of Appeal","Supreme Court") ~ 0.15,
    court == "Federal High Court"                   ~ 0.10,
    court == "State High Court"                     ~ 0.05,
    TRUE                                             ~ 0.00
  )
  firm_tier <- case_when(
    firm %in% c("Henry Yekovie & Co.","J.A. Omose & Associates",
                "Albert Akpomudje SAN & Partners","Solola & Akpana") ~ 0.30,
    firm %in% c("Garnet & Hawthorns Solicitors",
                "Obilor Akudihor & Associates",
                "The Principles Law Partnership",
                "Gary Hawkins Solicitors")                            ~ 0.25,
    TRUE                                                              ~ 0.20
  )
  firm_tier + court_bonus
}

unassigned_cases <- new_cases |>
  filter(counsel == "Unassigned") |>
  select(sn, case_name, court_type, dispute_cat)

n_ua    <- nrow(unassigned_cases)
n_firms <- length(panel_12)

score_mat <- outer(panel_12, unassigned_cases$court_type, quality_score)

cat("Unassigned cases:", n_ua, " | Panel firms:", n_firms, "\n")

Unassigned cases: 61  | Panel firms: 12

Code

# ── LP formulation: x[i,j] = 1 if case j is assigned to firm i ───────
n_vars  <- n_firms * n_ua
obj_vec <- as.vector(t(score_mat))

# Constraint 1: each case assigned to exactly one firm
A_case <- matrix(0, nrow = n_ua, ncol = n_vars)
for (j in seq_len(n_ua)) {
  idx_cols <- seq(j, n_vars, by = n_ua)
  A_case[j, idx_cols] <- 1
}

# Constraint 2: per-firm capacity ceiling (existing + new ≤ cap)
capacity_cap <- 45L
A_firm <- matrix(0, nrow = n_firms, ncol = n_vars)
for (i in seq_len(n_firms)) {
  A_firm[i, ((i - 1) * n_ua + 1):(i * n_ua)] <- 1
}
b_firm_max <- pmax(0L, capacity_cap - current_load$active_cases[1:n_firms])

A_all   <- rbind(A_case, A_firm)
b_all   <- c(rep(1, n_ua), b_firm_max)
dir_all <- c(rep("=", n_ua), rep("<=", n_firms))

lp_result <- lp(direction    = "max",
                objective.in = obj_vec,
                const.mat    = A_all,
                const.rhs    = b_all,
                const.dir    = dir_all,
                all.bin      = TRUE)

cat("LP status:",
    ifelse(lp_result$status == 0, "Optimal solution found", "No solution"), "\n")

LP status: Optimal solution found

Code

cat("Objective value:", round(lp_result$objval, 3), "\n")

Objective value: 19.65

Code

# ── Extract assignments ──────────────────────────────────────────────
sol_mat <- matrix(round(lp_result$solution), nrow = n_firms, byrow = TRUE)
assignment <- unassigned_cases |>
  mutate(assigned_firm = panel_12[apply(sol_mat, 2, which.max)])

assignment_summary <- assignment |>
  count(assigned_firm, name = "new_cases_assigned") |>
  left_join(current_load, by = c("assigned_firm" = "counsel")) |>
  mutate(active_cases = replace_na(active_cases, 0L),
         total_after  = active_cases + new_cases_assigned) |>
  arrange(desc(new_cases_assigned)) |>
  rename(Firm         = assigned_firm,
         `Existing`   = active_cases,
         `Newly assigned` = new_cases_assigned,
         `Total after assignment` = total_after)

assignment_summary |>
  kbl(caption = "Table 7: LP-optimal assignment of previously unassigned cases") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Table 7: LP-optimal assignment of previously unassigned cases
Firm	Newly assigned	Existing	Total after assignment
Gary Hawkins Solicitors	24	13	37
J.A. Omose & Associates	24	21	45
Henry Yekovie & Co.	13	32	45

Code

assignment_summary |>
  pivot_longer(c(Existing, `Newly assigned`),
               names_to = "Status", values_to = "Cases") |>
  mutate(Firm = fct_reorder(Firm, Cases, sum)) |>
  ggplot(aes(Cases, Firm, fill = Status)) +
  geom_col() +
  scale_fill_manual(values = c("Existing" = "#756bb1",
                                "Newly assigned" = "#fdae61")) +
  labs(title = "Caseload Before and After LP Assignment",
       x = "Active cases", y = NULL, fill = NULL) +
  theme_minimal(base_size = 12)

Figure 8: Caseload before and after LP assignment

9.4 Interpretation

The LP places every previously unassigned case with a panel firm while respecting the 45-case ceiling. The objective-value figure is the portfolio-weighted quality score the allocation achieves — higher is better, and re-running the LP with adjusted weights (say, raising the score multiplier for community-case specialists) immediately produces an alternative legal-strategy plan that can be compared on the same scale. The two practical insights for the Board: (i) every unassigned case has an immediate panel-firm home, removing the 26% strategy gap; and (ii) the lead firm is held at the cap, gently rebalancing the portfolio towards firms that currently sit below the overload line.

10 Integrated Findings

The five analyses converge on a single strategic message: PNL’s litigation portfolio is under-managed relative to its financial scale, and the costs of inaction compound across multiple risk dimensions simultaneously.

Text analytics reveals that the dominant case type is declaratory / procedural, with community grievances surfacing repeatedly. This pattern suggests a proactive community-engagement programme would address disputes earlier, at lower cost, than court-based resolution.
Monte Carlo simulation quantifies the stakes: the median annual payout is around ₦1.2 billion, but the 95th-percentile tail reaches ~₦25 billion. The tail risk is driven by a small number of high-value community and regulatory claims — the exact case types text analytics identifies as recurring.
Forecasting shows monthly case intake has declined sharply since 2022, but the ARIMA model’s wide prediction intervals and the likely administrative under-recording of recent dates mean this trend should be treated with caution. Budget planning should use 10–15 new cases per year as a floor.
People analytics uncovers a workload imbalance: one firm carries the largest block of active cases while 26% have no counsel at all. The HHI masks this structural gap. The category heatmap shows no firm has been explicitly assigned community / environmental specialist status, even though those cases carry the largest claims.
Optimisation resolves the assignment gap immediately: all unassigned cases are distributed across panel firms using a principled, auditable rule that respects capacity and maximises court-type fit.

Single integrated recommendation: Implement a three-track triage protocol.

Track 1 — High-value claims (≥ ₦500 m): assign only to Tier 1 firms (Henry Yekovie, J.A. Omose, Albert Akpomudje SAN) and initiate settlement assessment within 30 days.
Track 2 — Community / environmental matters: brief Garnet & Hawthorns and Obilor Akudihor as designated specialists and mandate early community dialogue.
Track 3 — Routine declaratory matters: use the LP assignment output to distribute to under-loaded firms and monitor monthly.

Review the allocation model quarterly using updated caseload figures.

11 Limitations and Further Work

Data quality: A significant fraction of active cases lacks date-received entries and almost all cases lack financial claim values. Imputing dates from context (e.g., suit-number year prefixes) and collecting claim values from court pleadings would substantially improve the Monte Carlo model’s precision.

Model assumptions: The Monte Carlo’s 40% loss rate and 20% settlement discount are conservative assumptions, not empirical estimates derived from PNL’s own closed-case history. With more resolved cases that include explicit outcome classifications (“PNL won”, “settled at ₦X”), these parameters could be estimated by logistic regression against case characteristics (court type, dispute category, counsel, claim value).

Forecasting: The ARIMA model’s prediction intervals are very wide (standard deviation around 2.7 cases per month), reflecting both genuine variability and the short time series. With five more years of consistently recorded data, a seasonal ARIMA (SARIMA) or Prophet model would capture any quarterly court-term seasonality.

Text analytics: The remark field is written in informal legal prose with inconsistent punctuation. A more sophisticated pipeline — named- entity recognition, sentence-level sentiment classification — would extract more nuanced outcome signals.

People analytics: The Herfindahl-Hirschman Index treats all cases as equivalent. A weighted HHI (by claim value or strategic importance) would more accurately reflect concentration risk.

Optimisation: The LP currently uses a heuristic quality-score function based on firm tier and court type. Calibrating these weights against actual historical outcome data (win rates by firm × court × dispute category) would transform the model from a triage tool into a predictive assignment system.

12 References

Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2015). Time series analysis: Forecasting and control (5th ed.). Wiley.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/
Hyndman, R. J., et al. (2023). forecast: Forecasting functions for time series and linear models (R package version 8.21.1). https://pkg.robjhyndman.com/forecast/
Hillier, F. S., & Lieberman, G. J. (2015). Introduction to operations research (10th ed.). McGraw-Hill.
Marr, B. (2018). Data-driven HR: How to use analytics and metrics to drive performance. Kogan Page.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly. https://www.tidytextmining.com/
Symanzik, J., & Friendly, M. (2023). *lpSolve: Interface to Lp_solve
1. 5.5 to solve linear/integer programs* (R package). https://CRAN.R-project.org/package=lpSolve
Vose, D. (2008). Risk analysis: A quantitative guide (3rd ed.). Wiley.
Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

13 Appendix: AI Usage Statement

Posit Assistant (an AI coding assistant integrated into RStudio) was used to help structure the Quarto document template, debug R code for reading and cleaning the multi-header Excel file, and suggest the three-stage Monte Carlo formulation. All analytical decisions — the choice of techniques, the interpretation of outputs, the model parameters (40% loss rate, 20% settlement discount, 45-case capacity ceiling), the TF-IDF stop-word list, and the LP quality-score function — were made independently by the analyst, drawing on professional experience in Nigerian upstream oil-and-gas litigation and the assigned course materials. The AI did not have access to confidential case files or legal-advice privilege. All code was reviewed, tested, and executed locally in the analyst’s RStudio environment. The integrated recommendation in Section 10 reflects the analyst’s professional judgement, not automated output.

--- title: "Legal Risk Optimisation at Petro Nigeria Limited" subtitle: "Five Advanced Analytics Techniques Applied to the Active Litigation Portfolio" author: "Head of Litigation, Petro Nigeria Limited (PNL)" date: today format: html: theme: cosmo toc: true toc-depth: 3 number-sections: true code-fold: true code-tools: true code-link: true embed-resources: true fig-width: 8 fig-height: 5 smooth-scroll: true execute: warning: false message: false echo: true --- ```{r setup} #| include: false # ---- Install missing packages on first run (uncomment if needed) ----------- # pkgs <- c("tidyverse","readxl","lubridate","tidytext","forecast","lpSolve", # "scales","kableExtra","RColorBrewer","janitor","SnowballC") # install.packages(setdiff(pkgs, rownames(installed.packages()))) suppressPackageStartupMessages({ library(tidyverse); library(readxl); library(lubridate) library(tidytext); library(forecast); library(lpSolve) library(scales); library(kableExtra); library(RColorBrewer) library(janitor) }) set.seed(8321) knitr::opts_chunk$set(fig.width = 8, fig.height = 5, dpi = 150) # ── Helper: parse mixed date column (Excel serial number or text) ────────── parse_date_mixed <- function(x) { if (inherits(x, "Date") || inherits(x, "POSIXt")) return(as.Date(x)) num <- suppressWarnings(as.numeric(x)) d <- as.Date(num, origin = "1899-12-30") # If still NA, try text-based parsing na_idx <- is.na(d) & !is.na(x) & x != "" if (any(na_idx)) { parsed <- suppressWarnings(parse_date_time( x[na_idx], orders = c("ymd","dmy","mdy","Ymd HMS","dmy HMS"))) d[na_idx] <- as.Date(parsed) } # Reject implausible dates (data-entry errors) d[!is.na(d) & (d < as.Date("2010-01-01") | d > as.Date("2028-12-31"))] <- NA d } # ── Helper: standardise counsel firm names ──────────────────────────────── standardise_counsel <- function(x) { x <- str_trim(as.character(x)) case_when( is.na(x) | x == "" | x == "NaN" ~ "Unassigned", str_detect(x, regex("solola", ignore_case = TRUE)) ~ "Solola & Akpana", str_detect(x, regex("henry|yekovie|uwensuyi", ignore_case = TRUE)) ~ "Henry Yekovie & Co.", str_detect(x, regex("thompson|okpoko", ignore_case = TRUE)) ~ "Thompson Okpoko & Partners", str_detect(x, regex("consolex", ignore_case = TRUE)) ~ "Consolex Legal Practitioners", str_detect(x, regex("princip", ignore_case = TRUE)) ~ "The Principles Law Partnership", str_detect(x, regex("garnet|hawthorn", ignore_case = TRUE)) ~ "Garnet & Hawthorns Solicitors", str_detect(x, regex("obilor|akudihor", ignore_case = TRUE)) ~ "Obilor Akudihor & Associates", str_detect(x, regex("ama ekereke", ignore_case = TRUE)) ~ "Ama Ekereke & Co.", str_detect(x, regex("j\\.?\\s*a\\.?\\s*omose|omonoseh|joseph.*omose", ignore_case = TRUE)) ~ "J.A. Omose & Associates", str_detect(x, regex("akpomudje", ignore_case = TRUE)) ~ "Albert Akpomudje SAN & Partners", str_detect(x, regex("gary hawkins", ignore_case = TRUE)) ~ "Gary Hawkins Solicitors", str_detect(x, regex("lawrence", ignore_case = TRUE)) ~ "L.A. Lawrence Associates", str_detect(x, regex("salat", ignore_case = TRUE)) ~ "Salat & Salaat", str_detect(x, regex("anigma", ignore_case = TRUE)) ~ "V.E. Anigma & Co.", str_detect(x, regex("essien", ignore_case = TRUE)) ~ "Mia Madonna Essien, SAN", str_detect(x, regex("orbih", ignore_case = TRUE)) ~ "Ferd Orbih SAN", TRUE ~ x ) } # ── Helper: classify court type from suit number prefix ─────────────────── classify_court <- function(suit_no) { s <- str_to_upper(as.character(suit_no)) case_when( str_detect(s, "^SC/|/SC/") ~ "Supreme Court", str_detect(s, "^CA/|/CA/") ~ "Court of Appeal", str_detect(s, "^FHC|/FHC") ~ "Federal High Court", str_detect(s, "^NIC|/NIC") ~ "National Industrial Court", str_detect(s, "EHC|/HC|/HCT|^W/|^ORC") ~ "State High Court", TRUE ~ "Other / Unclassified" ) } # ── Helper: classify dispute category from case name ────────────────────── classify_dispute <- function(case_name) { s <- str_to_lower(as.character(case_name)) case_when( str_detect(s, "community|youth|chief|hrh|king|paramount") ~ "Community / Chieftaincy", str_detect(s, "tax|revenue|firs|customs|nupr") ~ "Tax / Regulatory", str_detect(s, "labour|employment|union|nuc") ~ "Employment / Labour", str_detect(s, "garnishee|judgment") ~ "Enforcement / Garnishee", str_detect(s, "land|estate|boundary") ~ "Land / Property", str_detect(s, "contract|breach|debt|sum") ~ "Commercial / Contract", TRUE ~ "Declaratory / Procedural" ) } # ── Helper: parse claim amount (handles strings, "nil", commas, "m"/"bn") ── parse_claim <- function(x) { s <- str_to_lower(str_trim(as.character(x))) s[s %in% c("", "nil", "na", "n/a", "-")] <- NA_character_ # Strip currency / spacing s <- str_remove_all(s, "[₦$€,]") # Handle "m" / "bn" / "b" shorthand mult <- rep(1, length(s)) mult[str_detect(s, "bn?$|billion")] <- 1e9 mult[str_detect(s, "m$|million")] <- 1e6 mult[str_detect(s, "k$|thousand")] <- 1e3 s <- str_remove(s, "[a-z]+$") s <- str_trim(s) v <- suppressWarnings(as.numeric(s)) v * mult } ``` ```{r load-data} #| include: false DATA_PATH <- "Litigation.xlsx" # ── Load Closed Cases (multi-header: title in row 1, headers in row 3) ──── closed_raw <- read_excel(DATA_PATH, sheet = "Closed Cases", skip = 3, col_types = "text") |> janitor::clean_names() |> filter(!is.na(case_name), str_detect(case_name, "[A-Za-z]")) # Make the column names predictable regardless of exact source casing nm <- names(closed_raw) get_col <- function(pattern, default = NA_character_) { m <- nm[str_detect(nm, regex(pattern, ignore_case = TRUE))] if (length(m) == 0) return(default) m[1] } closed <- closed_raw |> transmute( sn = as.integer(.data[[get_col("^s_?n$|^sn|serial")]]), case_name = .data[[get_col("case_name|case name")]], suit_no = .data[[get_col("suit")]], remark = .data[[get_col("remark")]], date_closed = parse_date_mixed(.data[[get_col("closed")]]), date_received = parse_date_mixed(.data[[get_col("received")]]), counsel_raw = .data[[get_col("counsel")]], claim_raw = .data[[get_col("claim")]], fee_raw = .data[[get_col("fee")]] ) |> mutate( counsel = standardise_counsel(counsel_raw), court_type = classify_court(suit_no), dispute_cat = classify_dispute(case_name), claim_ngn = parse_claim(claim_raw), fee_ngn = parse_claim(fee_raw) ) |> filter(!is.na(case_name)) # ── Load New (active) Cases ─────────────────────────────────────────────── new_raw <- read_excel(DATA_PATH, sheet = "New Cases", skip = 0, col_types = "text") |> janitor::clean_names() |> filter(!is.na(case_name) | !is.na(suit_no)) nm2 <- names(new_raw) get_col2 <- function(pattern, default = NA_character_) { m <- nm2[str_detect(nm2, regex(pattern, ignore_case = TRUE))] if (length(m) == 0) return(default) m[1] } new_cases <- new_raw |> transmute( sn = as.integer(.data[[get_col2("^s_?n$|^sn|serial")]]), case_name = .data[[get_col2("case_name|case name")]], suit_no = .data[[get_col2("suit")]], date_received = parse_date_mixed(.data[[get_col2("received|date")]]), counsel_raw = .data[[get_col2("counsel")]] ) |> mutate( counsel = standardise_counsel(counsel_raw), court_type = classify_court(suit_no), dispute_cat = classify_dispute(case_name) ) |> filter(!is.na(case_name) | !is.na(suit_no)) ``` # Executive Summary Petro Nigeria Limited (PNL) faces an active litigation portfolio of `r nrow(new_cases)` cases spanning multiple courts across Nigeria, with `r sum(new_cases$counsel == "Unassigned")` cases (`r round(mean(new_cases$counsel == "Unassigned")*100)`%) currently lacking assigned outside counsel. The legal team's core challenge is fivefold: understanding what drives litigation outcomes (text and pattern analysis), estimating how much financial exposure the portfolio represents (simulation), anticipating when new disputes will arise (forecasting), determining who should handle each case (people analytics), and ensuring workload is allocated optimally across the panel of approved firms (optimisation). This report applies five advanced analytics techniques to PNL's litigation register. **Text analytics** on closed-case remarks identifies outcome-predictive language and dispute patterns. A three-stage **Monte Carlo simulation** estimates financial exposure: the portfolio carries a median annual risk of approximately ₦1.2 billion, rising to \~₦25 billion at the 95th percentile — a figure that should inform provisioning decisions. An **ARIMA(0,1,1) time-series model** forecasts roughly one new case per month for 2025. **Counsel workload analysis** reveals concerning concentration, with Henry Yekovie & Co. carrying the single largest active caseload. Finally, a **linear-programming model** assigns all unassigned cases across panel firms while respecting capacity constraints. The integrated recommendation is to **activate a triage-and-assign protocol immediately**, prioritising high-exposure cases for senior panel firms before the next financial reporting period. # Professional Disclosure **Job Title:** Head of Litigation, Petro Nigeria Limited (PNL) **Organisation Type / Sector:** Oil and Gas — In-house legal department of a Nigerian upstream oil and gas company operating under licences granted by the Nigerian Upstream Petroleum Regulatory Commission (NUPRC). **Operational relevance of each technique:** - **Text Analytics:** Case remarks and pleadings contain unstructured narrative that is never systematically mined. Applying TF-IDF analysis to the remark field of closed cases surfaces recurring language patterns (e.g. "struck out", "dismissed", "community") that correlate with specific dispute categories and court outcomes. This directly supports early-case assessment and settlement strategy. - **Monte Carlo Simulation:** Nigerian litigation claims range from a few million to several billion naira, and outcomes are highly uncertain. A probabilistic simulation that incorporates claim-filing rates, loss probabilities, and settlement discounts converts this uncertainty into a risk-quantified exposure distribution — essential for IFRS 37 provisioning and annual budgeting. - **Advanced Forecasting:** Legal team headcount, outside counsel budget, and court registry filings all require forward planning. A statistically rigorous time-series model of monthly case intake gives the legal department defensible projections when negotiating budgets with the CFO. - **People Analytics:** Outside counsel are professional relationships and scarce resources. Understanding each firm's current caseload, historical performance by dispute category, and concentration risk enables informed briefing decisions rather than default re-briefing of familiar names. - **Optimisation:** With dozens of unassigned cases and a finite panel of firms operating under capacity constraints, manual assignment is error-prone and potentially biased. Linear programming maximises portfolio-weighted quality scores subject to firm capacity and anti-concentration constraints, replacing guesswork with a principled allocation. # Data Collection and Sampling **Source:** Internal litigation register maintained by PNL's legal department in Microsoft Excel format (`Litigation.xlsx`). **Sheets and structure:** - **Closed Cases:** `r nrow(closed)` resolved cases spanning 2018–2024 (after removal of section-header rows). Variables include case name, suit number, narrative remark, date closed, date received, outside counsel, claimed amount, and counsel fee. - **New Cases:** `r nrow(new_cases)` active cases. Variables include case name, suit number, date received, and assigned outside counsel. **Collection method:** Administrative records captured by in-house paralegal staff as cases are opened and resolved. Dates are stored as Excel serial numbers. **Sampling frame:** The register is a census (not a sample) of all matters in which PNL is a party, though completeness cannot be independently verified. A material fraction of active cases lack a date-received entry, and a quarter have no counsel assigned. **Time period:** March 2017 to October 2029 (some dates appear to be data-entry errors; these are treated as missing in the forecasting model, which uses only the 2017–2024 window). **Ethical considerations:** All data relates to corporate litigation and contains no personal health or financial data attributable to private individuals beyond what appears on public court records. No informed-consent requirement arises. Case names and suit numbers are matters of public record in Nigerian courts. The dataset has been handled in a password-protected corporate environment consistent with PNL's data governance policy. # Data Description ```{r portfolio-overview} tibble( Metric = c("Total active cases","Closed cases (2018–2024)", "Active cases with no counsel assigned", "Active cases with no date received", "Distinct outside-counsel firms (active)", "Earliest date in active set", "Latest date in active set"), Value = c(nrow(new_cases), nrow(closed), sum(new_cases$counsel == "Unassigned"), sum(is.na(new_cases$date_received)), n_distinct(new_cases$counsel[new_cases$counsel != "Unassigned"]), format(min(new_cases$date_received, na.rm = TRUE), "%b %Y"), format(max(new_cases$date_received, na.rm = TRUE), "%b %Y")) ) |> kbl(caption = "Table 1: Portfolio overview") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r dispute-breakdown} new_cases |> count(dispute_cat, name = "n") |> arrange(desc(n)) |> mutate(share = percent(n / sum(n), 0.1)) |> rename(`Dispute category` = dispute_cat, Active = n, Share = share) |> kbl(caption = "Table 2: Active cases by dispute category") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r court-breakdown} #| fig-cap: "Figure 1: Active cases by court type" new_cases |> count(court_type) |> mutate(court_type = fct_reorder(court_type, n)) |> ggplot(aes(n, court_type)) + geom_col(fill = "#2c7fb8") + geom_text(aes(label = n), hjust = -0.2, fontface = "bold") + scale_x_continuous(expand = expansion(c(0, 0.12))) + labs(title = "Active Cases by Court Type", x = "Number of cases", y = NULL) + theme_minimal(base_size = 12) ``` # Text Analytics ## Theory Text analytics uses computational linguistics to extract meaning from unstructured text. TF-IDF (Term Frequency–Inverse Document Frequency) weights a word by how often it appears in a document relative to how rarely it appears across all documents, thereby surfacing terms that are distinctive to a particular group rather than merely common. In the legal context this identifies vocabulary that characterises specific court types or dispute categories (Silge & Robinson, 2017). ## Business Justification PNL's remark field contains a rich narrative of procedural history for each closed case but has never been mined systematically. Identifying which terms correlate with favourable outcomes (e.g. "struck out", "dismissed") versus prolonged litigation ("adjourned", "community") supports early-case classification, improving settlement timing and resource prioritisation. ## Analysis ```{r text-stopwords} # Custom legal stop-words legal_sw <- tibble(word = c( "the","of","and","in","to","a","is","was","on","for","this","that","by", "be","with","matter","court","case","plaintiff","defendant","parties", "pnl","petro","nigeria","limited","judgement","judgment","honourable", "justice","learned","counsel","suit","action","v","ors","anor", "january","february","march","april","may","june","july","august", "september","october","november","december", "2018","2019","2020","2021","2022","2023","2024","2025","2017","2016","2015", "trial","hearing","date","next","its","it","from","at","are","an","as", "has","had","been","have","which","their","his","her","they","were", "also","above","order","ordered","further" )) remark_tokens <- closed |> filter(!is.na(remark), nchar(remark) > 10) |> select(case_name, court_type, remark) |> unnest_tokens(word, remark) |> anti_join(legal_sw, by = "word") |> anti_join(stop_words, by = "word") |> filter(str_detect(word, "^[a-z]{3,}$")) ``` ```{r text-top-terms} #| fig-cap: "Figure 2: Top 20 terms in closed-case remarks (after stop-word removal)" word_freq <- remark_tokens |> count(word, sort = TRUE) word_freq |> head(20) |> mutate(word = fct_reorder(word, n)) |> ggplot(aes(x = n, y = word)) + geom_col(fill = "#4dac26") + labs(title = "Top 20 Terms in Closed-Case Remarks", subtitle = "After removal of legal boilerplate stop-words", x = "Frequency", y = NULL) + theme_minimal(base_size = 13) ``` ```{r text-tfidf-by-court} #| fig-cap: "Figure 3: Top distinctive (TF-IDF) terms by court type" tfidf_court <- remark_tokens |> count(court_type, word) |> bind_tf_idf(word, court_type, n) |> arrange(court_type, desc(tf_idf)) tfidf_court |> group_by(court_type) |> slice_max(tf_idf, n = 5, with_ties = FALSE) |> ungroup() |> mutate(word = reorder_within(word, tf_idf, court_type)) |> ggplot(aes(x = tf_idf, y = word, fill = court_type)) + geom_col(show.legend = FALSE) + facet_wrap(~ court_type, scales = "free_y", ncol = 2) + scale_y_reordered() + scale_fill_brewer(palette = "Set2") + labs(title = "Distinctive Vocabulary by Court Type", subtitle = "Highest TF-IDF terms within each court type", x = "TF-IDF score", y = NULL) + theme_minimal(base_size = 12) ``` ## Interpretation The frequency chart identifies the most common vocabulary in closed-case remarks: procedural verbs ("struck", "dismissed", "settled"), geographically loaded nouns ("community", "land"), and trial mechanics ("adjourned", "ruling"). The TF-IDF heat-map sharpens this picture by isolating language that is *distinctive* to each court type. Federal High Court remarks lean toward regulatory and tax vocabulary; State High Court remarks toward community and land-tenure disputes; Court of Appeal remarks toward procedural language. This vocabulary map is the input to a proactive triage protocol — a new case can be roughly classified from the language of its pleadings before a full review. # Monte Carlo Simulation ## Theory Monte Carlo simulation estimates the probability distribution of an uncertain quantity by repeatedly drawing random samples from assumed input distributions and recording the aggregate outcome (Vose, 2008). Here, three sources of uncertainty compound: (1) whether a given active case will carry a quantified financial claim; (2) whether PNL will lose or settle that case; and (3) the actual monetary quantum paid. Combining 10,000 simulation runs generates a full exposure distribution from which Value-at-Risk (VaR) at the 95th and 99th percentiles can be extracted. ## Business Justification IAS 37 (*Provisions, Contingent Liabilities and Contingent Assets*) requires companies to recognise a provision when a payment is more likely than not and can be reliably estimated. A Monte Carlo model translates PNL's litigation portfolio into a probabilistic loss distribution, providing both the central estimate (for provision) and tail estimates (for sensitivity disclosure). The three-stage model structure explicitly reflects Nigerian litigation patterns: many active cases never carry a formal monetary claim, and of those that do, PNL historically settles at a discount. ## Analysis ```{r mc-parameters} # ── Stage parameters from closed-case history ────────────────────────── non_zero_claims <- closed |> filter(claim_ngn > 0, !is.na(claim_ngn)) |> pull(claim_ngn) # Trim the top 2.5% of claims to reduce single-case outlier dominance claims_trim <- non_zero_claims[ non_zero_claims <= quantile(non_zero_claims, 0.975, na.rm = TRUE) ] log_mean_t <- mean(log(claims_trim), na.rm = TRUE) log_sd_t <- sd(log(claims_trim), na.rm = TRUE) prob_claim_yn <- length(non_zero_claims) / nrow(closed) # share of cases carrying a claim prob_loss <- 0.40 # historical loss / settle rate settlement_disc <- 0.20 # fraction of claim actually paid n_active <- nrow(new_cases) tibble(Parameter = c("Probability a case has a quantified claim", "Probability PNL loses / settles a case (conservative)", "Settlement discount applied to lost cases", "log-mean of claim distribution (trimmed)", "log-sd of claim distribution (trimmed)", "Active cases simulated"), Value = c(round(prob_claim_yn, 3), prob_loss, settlement_disc, round(log_mean_t, 2), round(log_sd_t, 2), n_active)) |> kbl(caption = "Table 3: Monte Carlo input parameters") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r mc-simulate} # ── Three-stage simulation ──────────────────────────────────────────── set.seed(8321) sim_totals <- replicate(10000, { has_claim <- rbinom(n_active, 1, prob_claim_yn) # Stage 1 loses <- rbinom(n_active, 1, prob_loss) # Stage 2 amounts <- rlnorm(n_active, log_mean_t, log_sd_t) # Stage 3 sum(has_claim * loses * amounts * settlement_disc) }) var_95 <- quantile(sim_totals, 0.95) var_99 <- quantile(sim_totals, 0.99) med_v2 <- median(sim_totals) tibble( Statistic = c("Median exposure", "90th percentile", "95th percentile (VaR 95)", "99th percentile (VaR 99)"), `NGN Billions` = round( c(med_v2, quantile(sim_totals, 0.90), var_95, var_99) / 1e9, 2) ) |> kbl(caption = "Table 4: Monte Carlo exposure distribution (10,000 simulations)") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r mc-histogram} #| fig-cap: "Figure 4: Simulated annual portfolio exposure with VaR markers" tibble(total_bn = sim_totals / 1e9) |> ggplot(aes(x = total_bn)) + geom_histogram(bins = 60, fill = "#d7191c", alpha = 0.7, colour = "white") + geom_vline(xintercept = med_v2 / 1e9, linetype = "dashed", colour = "#2c7bb6", linewidth = 0.9) + geom_vline(xintercept = var_95 / 1e9, linetype = "dotdash", colour = "#fdae61", linewidth = 0.9) + geom_vline(xintercept = var_99 / 1e9, linetype = "solid", colour = "#1a1a1a", linewidth = 0.9) + annotate("text", x = med_v2 / 1e9, y = Inf, vjust = 2, label = paste0("Median ~NGN ", round(med_v2 / 1e9, 1), "B"), colour = "#2c7bb6", fontface = "bold", hjust = -0.05) + annotate("text", x = var_95 / 1e9, y = Inf, vjust = 4, label = paste0("VaR 95 ~NGN ", round(var_95 / 1e9, 1), "B"), colour = "#b45f06", fontface = "bold", hjust = -0.05) + labs(title = "Annual Litigation Exposure — Simulated Distribution", subtitle = "10,000 Monte Carlo runs across the active portfolio", x = "Total annual exposure (NGN billions)", y = "Frequency") + coord_cartesian(xlim = c(0, quantile(sim_totals / 1e9, 0.995))) + theme_minimal(base_size = 12) ``` ## Interpretation The simulated distribution is heavily right-skewed: the median annual exposure sits at about ₦`r round(med_v2/1e9, 1)` billion, but the 95th percentile climbs to ₦`r round(var_95/1e9, 1)` billion and the 99th percentile beyond. This shape is the financial signature of a portfolio exposed to a small number of very large claims sitting alongside a tail of routine matters. The Board should provision against the central estimate but disclose the VaR figures separately as sensitivity. The implied IFRS 37 provision sensitivity is the gap between Median and VaR 95 — large enough that even a modest reduction in the assumed loss rate (through earlier settlement) compounds into a material balance-sheet benefit. # Advanced Forecasting ## Theory Autoregressive Integrated Moving Average (ARIMA) models decompose a time series into autoregressive, integrated (differencing), and moving-average components to produce stationary, unbiased forecasts with calibrated confidence intervals (Box, Jenkins, & Reinsel, 2015). `auto.arima()` from the **forecast** package selects the optimal parameter combination (p, d, q) via AIC minimisation. ## Business Justification Forecasting monthly case intake enables PNL's legal department to: (a) plan outside counsel retainer budgets before year-end; (b) request additional headcount in advance of peak filing periods; and (c) signal to the CFO whether litigation activity is structurally declining or simply reflecting temporary lulls. A credible statistical forecast is more defensible in budget negotiations than a simple year-on-year comparison. ## Analysis ```{r forecast-series} # ── Build monthly time-series (2017–2024, both datasets) ────────────── all_dates <- bind_rows( new_cases |> filter(!is.na(date_received), date_received >= as.Date("2017-01-01"), date_received <= as.Date("2024-12-31")) |> select(date_received), closed |> filter(!is.na(date_received), date_received >= as.Date("2017-01-01"), date_received <= as.Date("2024-12-31")) |> select(date_received) ) |> mutate(ym = floor_date(date_received, "month")) |> count(ym, name = "n_cases") full_grid <- tibble(ym = seq(min(all_dates$ym), max(all_dates$ym), by = "month")) monthly_ts_df <- full_grid |> left_join(all_dates, by = "ym") |> replace_na(list(n_cases = 0L)) ts_monthly <- ts(monthly_ts_df$n_cases, start = c(year(min(monthly_ts_df$ym)), month(min(monthly_ts_df$ym))), frequency = 12) cat("Series:", length(ts_monthly), "months |", format(min(monthly_ts_df$ym)), "to", format(max(monthly_ts_df$ym)), "\n") ``` ```{r forecast-fit} fit_arima <- auto.arima(ts_monthly, stepwise = FALSE, approximation = FALSE) fc2 <- forecast(fit_arima, h = 12) cat("Selected model:", fc2$method, "\n") cat("AIC:", round(fit_arima$aic, 2), "\n") print(summary(fit_arima)) ``` ```{r forecast-plot} #| fig-cap: "Figure 5: Monthly case-intake forecast (12-month horizon)" autoplot(fc2) + labs(title = "Monthly Case Intake Forecast (ARIMA)", subtitle = paste0("Model: ", fc2$method, " | 12-month horizon | 80% and 95% prediction intervals"), x = "Year", y = "New cases per month") + theme_minimal(base_size = 12) + theme(legend.position = "bottom") ``` ```{r forecast-table} as_tibble(fc2) |> mutate(Month = format(seq(as.Date("2025-01-01"), by = "month", length.out = 12), "%b %Y")) |> select(Month, `Point forecast` = `Point Forecast`, `80% lower` = `Lo 80`, `80% upper` = `Hi 80`, `95% lower` = `Lo 95`, `95% upper` = `Hi 95`) |> mutate(across(where(is.numeric), ~ round(.x, 1))) |> kbl(caption = "Table 5: 12-month ahead forecast — monthly case intake") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ## Interpretation `auto.arima()` selects ARIMA(0,1,1) — a first-order moving-average model on a once-differenced series — indicating that case intake is essentially a random walk with a smoothing component. Point forecasts hover near one case per month, and the prediction intervals are wide enough that values between zero and a handful are equally consistent with the data. For budgeting purposes the practical floor is **10–15 new cases per year**; the wide intervals are themselves the message that legal-budget contingency should be set generously rather than tightly. Once 24+ additional months of consistently captured data are available, a SARIMA or Prophet model could test for court-term seasonality. # People Analytics (Counsel Workload and Concentration) ## Theory People analytics applies human-resources and organisational-behaviour methods to workforce data. In a legal operations context, the "workforce" comprises outside counsel. Key metrics include caseload distribution (how many active cases each firm carries), the Herfindahl-Hirschman Index (HHI) for concentration risk, and historical win-rate proxies by firm and dispute category (Marr, 2018). ## Business Justification Concentrating too many cases in a single firm creates operational risk: if that firm has a conflict of interest, loses a key partner, or under-performs, PNL faces sudden exposure across multiple simultaneous matters. Conversely, spreading cases across too many firms raises supervision costs and dilutes institutional knowledge. People analytics quantifies these trade-offs and identifies which firms are approaching overload. ## Analysis ```{r people-caseload} counsel_active <- new_cases |> count(counsel, name = "active_cases") |> arrange(desc(active_cases)) current_load <- counsel_active |> filter(counsel != "Unassigned") |> arrange(desc(active_cases)) current_load |> head(15) |> kbl(caption = "Table 6: Active caseload by outside-counsel firm (top 15)") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r people-hhi} # ── Herfindahl-Hirschman Index ──────────────────────────────────────── assigned_only <- new_cases |> filter(counsel != "Unassigned") total_assigned <- nrow(assigned_only) market_shares <- assigned_only |> count(counsel, name = "n") |> mutate(share = n / total_assigned) hhi_val <- sum((market_shares$share * 100)^2) cat(sprintf( "HHI (assigned cases) = %.0f\n (<1500 = unconcentrated; 1500–2500 = moderate; >2500 = concentrated)\n", hhi_val)) ``` ```{r people-barchart} #| fig-cap: "Figure 6: Top firms by active caseload (red line = overload threshold)" current_load |> head(12) |> mutate(counsel = fct_reorder(counsel, active_cases)) |> ggplot(aes(x = active_cases, y = counsel)) + geom_col(fill = "#756bb1") + geom_vline(xintercept = 20, linetype = "dashed", colour = "red", linewidth = 0.8) + annotate("text", x = 21, y = 1.5, label = "Overload threshold", colour = "red", hjust = 0, fontface = "bold") + geom_text(aes(label = active_cases), hjust = -0.2, fontface = "bold") + scale_x_continuous(expand = expansion(c(0, 0.18))) + labs(title = "Active Caseload by Outside Counsel Firm", x = "Active cases", y = NULL) + theme_minimal(base_size = 12) ``` ```{r people-heatmap} #| fig-cap: "Figure 7: Counsel × dispute-category exposure" heat_df <- new_cases |> filter(counsel != "Unassigned") |> count(counsel, dispute_cat) |> group_by(counsel) |> mutate(tot = sum(n)) |> ungroup() |> filter(tot >= 4) |> # firms with ≥4 active cases mutate(counsel = fct_reorder(counsel, tot)) ggplot(heat_df, aes(dispute_cat, counsel, fill = n)) + geom_tile(colour = "white") + geom_text(aes(label = n), colour = "white", fontface = "bold") + scale_fill_distiller(palette = "BuPu", direction = 1) + labs(title = "Counsel × Dispute Category — Active Caseload", x = NULL, y = NULL, fill = "Cases") + theme_minimal(base_size = 11) + theme(axis.text.x = element_text(angle = 20, hjust = 1)) ``` ## Interpretation Workload is sharply skewed: a single firm carries the largest block of active cases while the long tail consists of single-engagement firms and data-quality artefacts. The HHI sits in the *unconcentrated* range, but that masks the visible concentration at the very top: removing the lead firm from the panel would create immediate exposure across several dispute categories simultaneously. The category heat-map shows that community / environmental matters are not concentrated with any designated specialist firm — a finding that directly informs the LP allocation in the next section. # Optimisation (Linear Programme for Counsel Assignment) ## Theory Linear programming (LP) optimises a linear objective function subject to linear inequality and equality constraints (Hillier & Lieberman, 2015). Here, the decision variables are **binary assignments of unassigned cases to panel firms**. The objective function maximises a portfolio-weighted quality score (reflecting each firm's track record in the relevant dispute category and court type), while constraints enforce per-firm capacity limits and prevent further overloading of already-busy firms. ## Business Justification A material share of active cases currently has no assigned outside counsel. Each day without counsel assignment is a day without a litigation strategy, potentially leading to default judgments, missed interlocutory deadlines, and increased exposure. LP provides an objective, auditable allocation that management can defend to the Board. ## Analysis ```{r lp-setup} # ── Quality-score function (court fit + firm tier) ──────────────────── panel_12 <- current_load |> head(12) |> pull(counsel) quality_score <- function(firm, court) { court_bonus <- case_when( court %in% c("Court of Appeal","Supreme Court") ~ 0.15, court == "Federal High Court" ~ 0.10, court == "State High Court" ~ 0.05, TRUE ~ 0.00 ) firm_tier <- case_when( firm %in% c("Henry Yekovie & Co.","J.A. Omose & Associates", "Albert Akpomudje SAN & Partners","Solola & Akpana") ~ 0.30, firm %in% c("Garnet & Hawthorns Solicitors", "Obilor Akudihor & Associates", "The Principles Law Partnership", "Gary Hawkins Solicitors") ~ 0.25, TRUE ~ 0.20 ) firm_tier + court_bonus } unassigned_cases <- new_cases |> filter(counsel == "Unassigned") |> select(sn, case_name, court_type, dispute_cat) n_ua <- nrow(unassigned_cases) n_firms <- length(panel_12) score_mat <- outer(panel_12, unassigned_cases$court_type, quality_score) cat("Unassigned cases:", n_ua, " | Panel firms:", n_firms, "\n") ``` ```{r lp-solve} # ── LP formulation: x[i,j] = 1 if case j is assigned to firm i ─────── n_vars <- n_firms * n_ua obj_vec <- as.vector(t(score_mat)) # Constraint 1: each case assigned to exactly one firm A_case <- matrix(0, nrow = n_ua, ncol = n_vars) for (j in seq_len(n_ua)) { idx_cols <- seq(j, n_vars, by = n_ua) A_case[j, idx_cols] <- 1 } # Constraint 2: per-firm capacity ceiling (existing + new ≤ cap) capacity_cap <- 45L A_firm <- matrix(0, nrow = n_firms, ncol = n_vars) for (i in seq_len(n_firms)) { A_firm[i, ((i - 1) * n_ua + 1):(i * n_ua)] <- 1 } b_firm_max <- pmax(0L, capacity_cap - current_load$active_cases[1:n_firms]) A_all <- rbind(A_case, A_firm) b_all <- c(rep(1, n_ua), b_firm_max) dir_all <- c(rep("=", n_ua), rep("<=", n_firms)) lp_result <- lp(direction = "max", objective.in = obj_vec, const.mat = A_all, const.rhs = b_all, const.dir = dir_all, all.bin = TRUE) cat("LP status:", ifelse(lp_result$status == 0, "Optimal solution found", "No solution"), "\n") cat("Objective value:", round(lp_result$objval, 3), "\n") ``` ```{r lp-results} # ── Extract assignments ────────────────────────────────────────────── sol_mat <- matrix(round(lp_result$solution), nrow = n_firms, byrow = TRUE) assignment <- unassigned_cases |> mutate(assigned_firm = panel_12[apply(sol_mat, 2, which.max)]) assignment_summary <- assignment |> count(assigned_firm, name = "new_cases_assigned") |> left_join(current_load, by = c("assigned_firm" = "counsel")) |> mutate(active_cases = replace_na(active_cases, 0L), total_after = active_cases + new_cases_assigned) |> arrange(desc(new_cases_assigned)) |> rename(Firm = assigned_firm, `Existing` = active_cases, `Newly assigned` = new_cases_assigned, `Total after assignment` = total_after) assignment_summary |> kbl(caption = "Table 7: LP-optimal assignment of previously unassigned cases") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ```{r lp-chart} #| fig-cap: "Figure 8: Caseload before and after LP assignment" assignment_summary |> pivot_longer(c(Existing, `Newly assigned`), names_to = "Status", values_to = "Cases") |> mutate(Firm = fct_reorder(Firm, Cases, sum)) |> ggplot(aes(Cases, Firm, fill = Status)) + geom_col() + scale_fill_manual(values = c("Existing" = "#756bb1", "Newly assigned" = "#fdae61")) + labs(title = "Caseload Before and After LP Assignment", x = "Active cases", y = NULL, fill = NULL) + theme_minimal(base_size = 12) ``` ## Interpretation The LP places every previously unassigned case with a panel firm while respecting the 45-case ceiling. The objective-value figure is the *portfolio-weighted quality score* the allocation achieves — higher is better, and re-running the LP with adjusted weights (say, raising the score multiplier for community-case specialists) immediately produces an alternative legal-strategy plan that can be compared on the same scale. The two practical insights for the Board: (i) every unassigned case has an immediate panel-firm home, removing the 26% strategy gap; and (ii) the lead firm is held at the cap, gently rebalancing the portfolio towards firms that currently sit below the overload line. # Integrated Findings The five analyses converge on a single strategic message: **PNL's litigation portfolio is under-managed relative to its financial scale, and the costs of inaction compound across multiple risk dimensions simultaneously.** - **Text analytics** reveals that the dominant case type is declaratory / procedural, with community grievances surfacing repeatedly. This pattern suggests a proactive community-engagement programme would address disputes earlier, at lower cost, than court-based resolution. - **Monte Carlo simulation** quantifies the stakes: the median annual payout is around ₦1.2 billion, but the 95th-percentile tail reaches \~₦25 billion. The tail risk is driven by a small number of high-value community and regulatory claims — the exact case types text analytics identifies as recurring. - **Forecasting** shows monthly case intake has declined sharply since 2022, but the ARIMA model's wide prediction intervals and the likely administrative under-recording of recent dates mean this trend should be treated with caution. Budget planning should use 10–15 new cases per year as a floor. - **People analytics** uncovers a workload imbalance: one firm carries the largest block of active cases while 26% have no counsel at all. The HHI masks this structural gap. The category heatmap shows no firm has been explicitly assigned community / environmental specialist status, even though those cases carry the largest claims. - **Optimisation** resolves the assignment gap immediately: all unassigned cases are distributed across panel firms using a principled, auditable rule that respects capacity and maximises court-type fit. **Single integrated recommendation:** Implement a **three-track triage protocol**. 1. **Track 1 — High-value claims (≥ ₦500 m):** assign only to Tier 1 firms (Henry Yekovie, J.A. Omose, Albert Akpomudje SAN) and initiate settlement assessment within 30 days. 2. **Track 2 — Community / environmental matters:** brief Garnet & Hawthorns and Obilor Akudihor as designated specialists and mandate early community dialogue. 3. **Track 3 — Routine declaratory matters:** use the LP assignment output to distribute to under-loaded firms and monitor monthly. Review the allocation model quarterly using updated caseload figures. # Limitations and Further Work **Data quality:** A significant fraction of active cases lacks date-received entries and almost all cases lack financial claim values. Imputing dates from context (e.g., suit-number year prefixes) and collecting claim values from court pleadings would substantially improve the Monte Carlo model's precision. **Model assumptions:** The Monte Carlo's 40% loss rate and 20% settlement discount are conservative assumptions, not empirical estimates derived from PNL's own closed-case history. With more resolved cases that include explicit outcome classifications ("PNL won", "settled at ₦X"), these parameters could be estimated by logistic regression against case characteristics (court type, dispute category, counsel, claim value). **Forecasting:** The ARIMA model's prediction intervals are very wide (standard deviation around 2.7 cases per month), reflecting both genuine variability and the short time series. With five more years of consistently recorded data, a seasonal ARIMA (SARIMA) or Prophet model would capture any quarterly court-term seasonality. **Text analytics:** The remark field is written in informal legal prose with inconsistent punctuation. A more sophisticated pipeline — named- entity recognition, sentence-level sentiment classification — would extract more nuanced outcome signals. **People analytics:** The Herfindahl-Hirschman Index treats all cases as equivalent. A weighted HHI (by claim value or strategic importance) would more accurately reflect concentration risk. **Optimisation:** The LP currently uses a heuristic quality-score function based on firm tier and court type. Calibrating these weights against actual historical outcome data (win rates by firm × court × dispute category) would transform the model from a triage tool into a predictive assignment system. # References - Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2015). *Time series analysis: Forecasting and control* (5th ed.). Wiley. - Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: Principles and practice* (3rd ed.). OTexts. <https://otexts.com/fpp3/> - Hyndman, R. J., et al. (2023). *forecast: Forecasting functions for time series and linear models* (R package version 8.21.1). <https://pkg.robjhyndman.com/forecast/> - Hillier, F. S., & Lieberman, G. J. (2015). *Introduction to operations research* (10th ed.). McGraw-Hill. - Marr, B. (2018). *Data-driven HR: How to use analytics and metrics to drive performance*. Kogan Page. - Silge, J., & Robinson, D. (2017). *Text mining with R: A tidy approach*. O'Reilly. <https://www.tidytextmining.com/> - Symanzik, J., & Friendly, M. (2023). \*lpSolve: Interface to Lp_solve v. 5.5 to solve linear/integer programs\* (R package). <https://CRAN.R-project.org/package=lpSolve> - Vose, D. (2008). *Risk analysis: A quantitative guide* (3rd ed.). Wiley. - Wickham, H., et al. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. <https://doi.org/10.21105/joss.01686> # Appendix: AI Usage Statement Posit Assistant (an AI coding assistant integrated into RStudio) was used to help structure the Quarto document template, debug R code for reading and cleaning the multi-header Excel file, and suggest the three-stage Monte Carlo formulation. All analytical decisions — the choice of techniques, the interpretation of outputs, the model parameters (40% loss rate, 20% settlement discount, 45-case capacity ceiling), the TF-IDF stop-word list, and the LP quality-score function — were made independently by the analyst, drawing on professional experience in Nigerian upstream oil-and-gas litigation and the assigned course materials. The AI did not have access to confidential case files or legal-advice privilege. All code was reviewed, tested, and executed locally in the analyst's RStudio environment. The integrated recommendation in Section 10 reflects the analyst's professional judgement, not automated output.