Strategic Revenue Intelligence in Professional Services: Predicting and Scaling High-Value Client Engagements

Author

Oluwaseun Adubi

Published

May 20, 2026

1 Executive Summary

This study examines how engagement-level operational and financial data can be used to predict high-value client engagements within a mid-tier professional services firm. Using anonymised engagement records, the analysis applies predictive modelling, clustering, dimensionality reduction, and time-series forecasting techniques to generate strategic revenue intelligence insights.

The study aims to identify the operational characteristics associated with commercially successful engagements, improve engagement selection and staffing decisions, and support forward-looking revenue planning. Analytical techniques including logistic regression, random forest classification, clustering analysis, principal component analysis (PCA), and ARIMA forecasting are applied to evaluate both engagement profitability drivers and future revenue patterns.

The findings are intended to support evidence-based decision-making in client portfolio management, resource allocation, pricing strategy, and long-term business development planning.

2 Professional Disclosure

I work as a Manager at Stransact Services Limited, a mid-tier professional services firm whose service offerings span tax advisory, regulatory compliance, bookkeeping, audit and assurance support, IT and payroll consulting. My role sits at the intersection of client delivery, commercial execution and management decision support. The firm’s operating model depends on how effectively engagements are priced, staffed, delivered and retained, which means engagement-level data is directly relevant to my day-to-day responsibilities.

This project is not an abstract exercise. It addresses a genuine management problem: identifying which clients, services, pricing approaches and delivery structures generate high-value work, and how those patterns can be replicated deliberately. The five selected techniques- classification, model evaluation and explainability, clustering, dimensionality reduction, and time series analysis map directly to that objective.

3 Data Collection and Sampling

This study uses a structured, anonymised engagement-level dataset derived from the operational records of Stransact, a mid-tier professional services firm providing tax advisory, regulatory compliance, bookkeeping, audit support, payroll processing and IT consulting-related services. The dataset was developed specifically to examine the operational and financial characteristics associated with commercially successful engagements within a professional services environment.

The analytical focus of the study is consistent with the central objective of the project: identifying the drivers of high-value engagements and understanding how pricing, collections, staffing, service mix and client quality influence commercial outcomes. The final dataset was designed to support classification modelling, clustering, dimensionality reduction, explainability analysis and time-series forecasting within a unified revenue intelligence framework.

The dataset consists of 200 anonymised engagement records, with each row representing either a completed client engagement or an engagement-month where work extended across multiple billing periods. The structure of the dataset reflects how professional services firms typically monitor performance internally — at engagement level rather than transaction level — making it appropriate for both operational and strategic analysis.

3.1 Source of Data

The dataset was compiled from three internally maintained operational sources within Stransact.

The first source was the engagement and billing records, which provided variables relating to agreed fees, invoiced amounts, collected revenue, outstanding balances, pricing structure and billing rates. These records formed the core financial component of the analysis and allowed the study to evaluate engagement profitability, fee recovery performance and revenue concentration patterns.

The second source was the client profile records, which provided anonymised client attributes including client tier classification, industry grouping, tenure category and engagement relationship indicators. These variables were important in assessing whether commercially valuable engagements were associated with specific categories of clients or service relationships.

The third source was the staff utilisation and engagement delivery records, which provided operational variables such as budgeted hours, actual hours worked, billable hours, utilisation rates, realisation rates and engagement duration. These records allowed the study to evaluate operational efficiency alongside financial outcomes.

The integration of these three operational sources created a commercially meaningful dataset capable of linking client quality, engagement execution, pricing discipline and revenue outcomes within a single analytical framework.

3.2 Data Collection Method

The data was collected through a structured internal extraction and consolidation process. Relevant engagement records were exported from the firm’s billing systems, utilisation schedules and operational tracking records into spreadsheet format before being merged into a unified analytical dataset.

The extraction process used Engagement_ID as the primary engagement-level identifier and Client_ID as the anonymised client reference field. To maintain confidentiality, all client names and identifiable business information were removed prior to analysis and replaced with coded identifiers such as CLT_001, CLT_002 and CLT_003.

The consolidated dataset was then cleaned and validated within RStudio. Duplicate records, incomplete administrative entries and engagements lacking core financial information were excluded. Missing values were assessed and handled during pre-processing to ensure compatibility with downstream modelling techniques.

Several analytical variables were also derived during preparation. These included:

Collection_Rate_% Realisation_% Utilisation_% Outstanding_NGN000 High_Value_Engagement

The outcome variable, High_Value_Engagement, was created as a binary classification target representing commercially superior engagements based on a combination of fee size, collection quality and operational performance indicators.

The final dataset contained the following core fields:

  • Engagement_ID
  • Client_ID
  • Industry
  • Client_Tier
  • Service_Line
  • Engagement_Type
  • Agreed_Fee_NGN000
  • Invoiced_Amt_NGN000
  • Collected_Amt_NGN000
  • Outstanding_NGN000
  • Avg_Billing_Rate_NGN
  • Budgeted_Hours
  • Actual_Hours
  • Billable_Hours
  • Utilisation_%
  • Realisation_%
  • Collection_Rate_%
  • Engagement_Start
  • Engagement_End
  • Engagement_Month
  • High_Value_Engagement

These variables were selected because they collectively capture the financial scale, delivery efficiency, billing discipline and commercial outcomes of professional service engagements.

3.3 Sampling Frame

The sampling frame consisted of completed or substantially completed engagements recorded within Stransact’s operational systems during the selected observation period. The population included engagements across multiple service areas including tax compliance and advisory services, audit and bookkeeping support, payroll processing, IT and consulting engagements.

The sampling frame excluded:

  • purely administrative internal activities,
  • cancelled engagements with no commercial activity,
  • duplicate engagement records,
  • non-billable internal assignments,
  • and records missing essential financial variables.

The unit of analysis was defined as one client engagement or one engagement-month where engagements extended across multiple reporting periods. This approach was appropriate because operational and financial performance within professional services firms is commonly monitored at engagement level over time.

Using engagement-month observations also strengthened the time-series component of the study by allowing monthly aggregation of revenue and billing activity for forecasting analysis.

3.4 Sample Size

The final dataset consisted of 200 anonymised engagement records. This sample size was considered appropriate because it was large enough to support segmentation, classification and clustering analysis while remaining operationally manageable for detailed validation and interpretation.

It also provided enough variation across client tiers, engagement sizes, service lines, billing structures,and operational delivery profiles to allow meaningful comparative analysis.

3.5 Time Period Covered

The dataset covered engagements occurring between May 2023 and May 2025, providing a 24-month operational observation window. This period was appropriate for several reasons namely;

  • it captured recurring annual compliance cycles common within professional services work;
  • it included both high-activity and low-activity billing periods, showcasing seasonal behaviour and collection fluctuations.
  • it provided sufficient monthly observations to support the forecasting and time-series requirements of the study.

3.6 Sampling Technique and Justification

An operational sampling approach was adopted for the study. The objective of the study was not to generate population-level statistical inference, but rather to examine the operational and financial characteristics associated with commercially valuable engagements.

Consequently, engagements were selected based on analytical completeness and operational relevance rather than random selection alone. To reduce selection bias, the sample intentionally included engagements across:

  • different service lines,
  • multiple industries,
  • varying fee levels,
  • different client tiers,
  • and varying collection outcomes.

The final dataset therefore included:

  • both high-margin and low-margin engagements,
  • retained and non-retained work,
  • recurring and one-off engagements,
  • and both advisory and compliance-focused services.

This diversity strengthened the reliability of the clustering, classification and segmentation results by ensuring the models were trained on a commercially varied engagement base.

3.7 Ethical Considerations and Confidentiality

The study was conducted in accordance with confidentiality, responsible data use and data minimisation principles.

No client names, tax identification numbers, contact information, advisory memoranda or commercially sensitive narratives were included in the analytical dataset. All records were anonymised prior to analysis, and only aggregated findings, visualisations and model outputs are presented within the final report.

Access to the underlying operational records was restricted to the researcher for academic purposes only. The final dataset was used solely for educational and analytical purposes within the MBA programme.

A formal confidentiality statement for the project is presented below:

The dataset used for this study was extracted from internal engagement, billing and utilisation records of Stransact Services Limited strictly for academic and analytical purposes. All client identifiers were anonymised before analysis. Client names, contact details, tax identifiers, confidential advisory content and commercially sensitive narratives were excluded from the final dataset. Each client was represented using a coded Client_ID, and all analysis was performed at aggregated engagement level. The underlying operational data is not publicly available due to confidentiality restrictions and may only be reviewed by authorised academic assessors where necessary.

3.8 Data Provenance Statement

The primary dataset for this study is titled: “Anonymised Stransact Client Engagement Dataset.”

It was constructed from internally generated engagement, billing and utilisation records maintained by Stransactn Services Limited and prepared specifically for analytical evaluation within this study.

The preparation process involved:

  • extraction of operational records,
  • anonymisation of client identifiers,
  • consolidation of financial and delivery variables,
  • derivation of operational performance indicators,
  • and pre-processing within RStudio and Quarto.

The resulting dataset provided a commercially meaningful evidence base for analysing:

  • revenue concentration,
  • operational efficiency,
  • engagement profitability,
  • client quality,
  • collection discipline,
  • and drivers of commercially successful engagements.

It therefore served not only as a technical input for predictive analytics, clustering and forecasting models, but also as a practical operational foundation for evaluating how professional service firms convert technical delivery into scalable commercial performance.

4 Dataset Description

There were 200 engagement observations and 28 variables organised across five functional categories; identifiers, client profile, service and delivery, financial performance, and the target variable.

Each row represented one client engagement. The variable structure was designed to support all five required analytical techniques; financial and client variables feed the classification model; categorical and numeric predictors support explainability; multi-dimensional features support clustering and dimensionality reduction; and the engagement start date enables monthly time series aggregation.

4.1 Variable Names, Types and Operational Meaning

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

variable_tbl <- tribble(
  ~`Variable Name`, ~Type, ~Description, ~`Operational Relevance`,

  "Engagement_ID",
  "Character / Identifier",
  "Unique reference number for each engagement",
  "Distinguishes one engagement from another and supports data traceability",

  "Client_ID",
  "Character / Identifier",
  "Anonymised client reference",
  "Allows client-level analysis without disclosing client names",

  "Industry",
  "Categorical",
  "Sector in which the client operates",
  "Helps identify industries associated with stronger revenue or margins",

  "Client_Size",
  "Categorical",
  "Size band of the client, such as small, medium or large",
  "Supports comparison of commercial value across client categories",

  "Client_Tenure",
  "Numeric",
  "Length of client relationship, usually measured in months or years",
  "Indicates whether longer relationships produce stronger repeat work or profitability",

  "Service_Type",
  "Categorical",
  "Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting",
  "Helps determine which service lines contribute most to firm value",

  "Sub_Service",
  "Categorical",
  "More detailed service category under the main service type",
  "Provides more granular insight into specific offerings",

  "Pricing_Model",
  "Categorical",
  "Basis of pricing, such as fixed fee, hourly, retainer or blended pricing",
  "Supports pricing discipline and margin analysis",

  "Revenue",
  "Numeric",
  "Fee income generated from the engagement",
  "Measures commercial value and supports classification of high-value engagements",

  "Cost",
  "Numeric",
  "Direct cost or estimated delivery cost of the engagement",
  "Allows profitability to be assessed beyond revenue alone",

  "Profit",
  "Numeric",
  "Revenue less cost",
  "Measures absolute financial contribution",

  "Profit_Margin",
  "Numeric",
  "Profit divided by revenue",
  "Measures efficiency and quality of earnings",

  "Duration_Months",
  "Numeric",
  "Length of the engagement in months",
  "Helps assess whether longer engagements produce better or weaker commercial outcomes",

  "Team_Size",
  "Numeric",
  "Number of staff involved in delivering the engagement",
  "Supports analysis of resource deployment",

  "Hours_Billed",
  "Numeric",
  "Total hours charged or recorded on the engagement",
  "Measures effort intensity and delivery efficiency",

  "Client_Retention",
  "Categorical / Binary",
  "Indicates whether the client was retained",
  "Supports analysis of client relationship strength",

  "Repeat_Engagement",
  "Categorical / Binary",
  "Indicates whether the client gave repeat work",
  "Captures recurring commercial value",

  "Engagement_Month",
  "Date",
  "Month in which the engagement was recorded or billed",
  "Supports time series analysis and revenue trend review"
)

variable_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Dataset Variable Definitions and Operational Relevance
Variable Name Type Description Operational Relevance
Engagement_ID Character / Identifier Unique reference number for each engagement Distinguishes one engagement from another and supports data traceability
Client_ID Character / Identifier Anonymised client reference Allows client-level analysis without disclosing client names
Industry Categorical Sector in which the client operates Helps identify industries associated with stronger revenue or margins
Client_Size Categorical Size band of the client, such as small, medium or large Supports comparison of commercial value across client categories
Client_Tenure Numeric Length of client relationship, usually measured in months or years Indicates whether longer relationships produce stronger repeat work or profitability
Service_Type Categorical Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting Helps determine which service lines contribute most to firm value
Sub_Service Categorical More detailed service category under the main service type Provides more granular insight into specific offerings
Pricing_Model Categorical Basis of pricing, such as fixed fee, hourly, retainer or blended pricing Supports pricing discipline and margin analysis
Revenue Numeric Fee income generated from the engagement Measures commercial value and supports classification of high-value engagements
Cost Numeric Direct cost or estimated delivery cost of the engagement Allows profitability to be assessed beyond revenue alone
Profit Numeric Revenue less cost Measures absolute financial contribution
Profit_Margin Numeric Profit divided by revenue Measures efficiency and quality of earnings
Duration_Months Numeric Length of the engagement in months Helps assess whether longer engagements produce better or weaker commercial outcomes
Team_Size Numeric Number of staff involved in delivering the engagement Supports analysis of resource deployment
Hours_Billed Numeric Total hours charged or recorded on the engagement Measures effort intensity and delivery efficiency
Client_Retention Categorical / Binary Indicates whether the client was retained Supports analysis of client relationship strength
Repeat_Engagement Categorical / Binary Indicates whether the client gave repeat work Captures recurring commercial value
Engagement_Month Date Month in which the engagement was recorded or billed Supports time series analysis and revenue trend review

4.2 Target Variable Construction

The binary target - High_Value_Engagement was derived from a weighted multi-factor composite score (HV_Score, scale 0–100) across seven variables:

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

hv_score_tbl <- tribble(
  ~Factor, ~Variable, ~`Max Points`, ~Rationale,

  "Fee size",
  "Agreed_Fee_NGN000",
  35,
  "Primary revenue driver; scaled to maximum observed fee of ₦15m",

  "Realisation %",
  "Realisation_%",
  20,
  "Fee recovery efficiency; rewards engagements where invoiced ≈ agreed fee",

  "Collection rate",
  "Collection_Rate_%",
  20,
  "Cash conversion; penalises engagements with high outstanding balances",

  "Utilisation %",
  "Utilisation_%",
  10,
  "Staff efficiency; rewards high billable-to-actual ratios",

  "Client tier",
  "Client_Tier",
  8,
  "Gold = 8, Silver = 5, Bronze = 2",

  "Client size",
  "Client_Size",
  5,
  "Large Enterprise = 5, Mid-Market = 3, SME = 1",

  "Service line",
  "Service_Line",
  2,
  "IT Consulting = 2, Audit/Tax = 1, Payroll = 0 (complexity premium)"
)

hv_score_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
High-Value Engagement Scoring Framework
Factor Variable Max Points Rationale
Fee size Agreed_Fee_NGN000 35 Primary revenue driver; scaled to maximum observed fee of ₦15m
Realisation % Realisation_% 20 Fee recovery efficiency; rewards engagements where invoiced ≈ agreed fee
Collection rate Collection_Rate_% 20 Cash conversion; penalises engagements with high outstanding balances
Utilisation % Utilisation_% 10 Staff efficiency; rewards high billable-to-actual ratios
Client tier Client_Tier 8 Gold = 8, Silver = 5, Bronze = 2
Client size Client_Size 5 Large Enterprise = 5, Mid-Market = 3, SME = 1
Service line Service_Line 2 IT Consulting = 2, Audit/Tax = 1, Payroll = 0 (complexity premium)

An engagement with HV_Score ≥ 59 was classified as High Value (1); below 59 is Standard (0). The threshold produced a near-balanced split: 96 High Value (48%) and 104 Standard (52%). This balance was important because it meant a naive classifier predicted the majority class that achieved only 52% accuracy, creating a meaningful benchmark for the model to beat.

4.3 Data Description Narrative

Identifiers.

Engagement_ID and Client_ID are character identifiers. Engagement_ID distinguishes each row; Client_ID allows repeat-client patterns to be studied without disclosing names. Neither is used as a predictor in modelling.

Client profile variables

Service_Line, Engagement_Type, Industry, Client_Size, Client_Tier, Region, Status are categorical. They describe the commercial and geographic context of each engagement and are central to both the classification and clustering analyses. Service_Line has four levels (Audit, Tax, Payroll, IT Consulting); Client_Tier has three (Gold, Silver, Bronze); Region covers six Nigerian cities. These variables are expected to show meaningful variation in high-value engagement rates across levels.

Staffing variables

Partner_Incharge, Director_Incharge, Manager_Incharge, Staff_Count describe who delivered the engagement and with what headcount. Staff_Count is numeric (range 2–8). The personnel codes are categorical. These variables support analysis of whether certain fee-earners or team configurations are associated with stronger commercial outcomes.

Utilisation variables

Budgeted_Hours, Actual_Hours, Billable_Hours, Utilisation_% are numeric. Utilisation_% is computed as Billable Hours / Actual Hours × 100 which is the key efficiency ratio. High utilisation indicates that most recorded time was charged to the client. Engagements where actual hours significantly exceed budgeted hours may indicate scope creep, which is expected to correlate with weaker realisation rates.

Financial variables

Agreed_Fee_NGN000, Invoiced_Amt_NGN000, Collected_Amt_NGN000, Outstanding_NGN000, Realisation_%, Collection_Rate_%, Avg_Billing_Rate_NGN form the commercial core of the dataset. Agreed_Fee is the contracted amount. Invoiced_Amt may exceed or fall short of the agreed fee depending on scope changes or write-downs. Collected_Amt measures actual cash recovery. Outstanding_NGN000 is the uncollected balance. Realisation_% and Collection_Rate_% are the two key efficiency ratios. Revenue from professional services is typically right-skewed, with a small number of high-fee IT consulting engagements likely to account for a disproportionate share of total revenue.

Time variable

Engagement_Start and Engagement_End are date variables. They are used to compute engagement duration and to aggregate data monthly for the time series component. The 24-month observation window (May 2023–May 2025) provides sufficient periodicity for trend decomposition and short-term forecasting, with the caveat that ARIMA models typically require 36 or more periods. Holt-Winters Exponential Smoothing or Prophet were the recommended forecasting approaches.

Show Analysis Code
tibble(
  Variable      = names(data),
  Type          = map_chr(data, ~class(.x)[1]),
  Missing       =map_int(data,~sum(is.na(.x))),
  Unique_Values = map_int(data,~n_distinct(.x))
) |>
  kable(format = "html") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Variable Type and Completeness Summary
Variable Type Missing Unique_Values
engagement_id character 0 200
client_id character 0 85
service_line character 0 4
engagement_type character 0 14
industry character 0 10
client_size character 0 3
client_tier character 0 3
region character 0 6
engagement_start character 0 178
engagement_end character 0 165
status character 0 4
partner_incharge character 0 2
director_incharge character 0 3
manager_incharge character 0 5
staff_count numeric 0 7
budgeted_hours numeric 0 180
actual_hours numeric 0 182
billable_hours numeric 0 186
utilisation_percent numeric 0 145
agreed_fee_ngn000 numeric 0 174
invoiced_amt_ngn000 numeric 0 173
collected_amt_ngn000 numeric 0 169
outstanding_ngn000 numeric 0 123
realisation_percent numeric 0 133
collection_rate_percent numeric 0 161
avg_billing_rate_ngn numeric 0 199
hv_score numeric 0 193
high_value_engagement numeric 0 2

The dataset contained a combination of categorical, identifier, and numeric operational variables suitable for predictive modelling and segmentation analysis. Missing values appear limited across most variables, indicating relatively stable engagement-record quality. Identifier variables such as Engagement_ID and Client_ID exhibit high uniqueness and are therefore unsuitable as predictive inputs for machine-learning models.

Show Analysis Code
numeric_vars <- data |> select(where(is.numeric)) |> names()

data |> 
  select(all_of(numeric_vars)) |>
  pivot_longer(
    everything(), 
    names_to = 'Variable', 
    values_to = 'Value'
    ) |>
  group_by(Variable) |>
  summarise(
    Min    = round(min(Value, na.rm=TRUE), 1),
    Q1     = round(quantile(Value, .25,na.rm=TRUE), 1),
    Median = round(median(Value, na.rm=TRUE), 1),
    Mean   = round(mean(Value, na.rm=TRUE), 1),
    Q3     = round(quantile(Value, .75,na.rm=TRUE), 1),
    Max    = round(max(Value, na.rm=TRUE), 1),
    SD     = round(sd(Value, na.rm=TRUE), 1)
  ) |>
knitr::kable()
Variable Min Q1 Median Mean Q3 Max SD
actual_hours 83.0 381.0 635.5 651.1 914.0 1511.0 340.2
agreed_fee_ngn000 270.0 1430.0 2645.0 3725.8 4682.5 14940.0 3297.3
avg_billing_rate_ngn 303.0 2384.0 5224.0 9914.1 11443.2 110833.0 14619.0
billable_hours 70.0 312.0 531.0 543.6 759.0 1435.0 281.8
budgeted_hours 85.0 386.5 653.5 658.7 924.2 1195.0 309.9
collected_amt_ngn000 130.0 920.0 1765.0 2602.2 3270.0 12960.0 2409.3
collection_rate_percent 50.2 62.3 75.1 75.0 86.4 100.0 14.5
high_value_engagement 0.0 0.0 0.0 0.5 1.0 1.0 0.5
hv_score 44.6 53.2 57.9 59.8 63.9 91.7 9.4
invoiced_amt_ngn000 230.0 1282.5 2610.0 3497.5 4340.0 14670.0 3137.4
outstanding_ngn000 0.0 192.5 560.0 895.2 1232.5 5930.0 1059.3
realisation_percent 80.4 87.7 94.3 93.4 98.8 104.6 7.0
staff_count 2.0 3.0 5.0 5.1 7.0 8.0 2.0
utilisation_percent 70.0 77.2 83.8 84.2 91.4 100.0 8.5
Show Analysis Code
cat_vars <- c(
  "service_line",
  "engagement_type",
  "industry",
  "client_size",
  "client_tier",
  "partner_incharge",
  "manager_incharge"
)
data |>
  select(all_of(cat_vars)) |>
  pivot_longer(everything(), names_to = 'Variable', values_to = 'Category') |>
  count(Variable, Category, name = 'Frequency') |>
  group_by(Variable) |>
  mutate(Pct = round(Frequency / sum(Frequency) * 100, 1)
) |>
arrange(Variable, desc(Frequency)) |>
 knitr::kable()
Distribution of Categorical Variables
Variable Category Frequency Pct
client_size SME 69 34.5
client_size Mid-Market 67 33.5
client_size Large Enterprise 64 32.0
client_tier Gold 77 38.5
client_tier Bronze 65 32.5
client_tier Silver 58 29.0
engagement_type Monthly Payroll Processing 24 12.0
engagement_type Statutory Audit 17 8.5
engagement_type Tax Advisory 17 8.5
engagement_type Forensic Audit 16 8.0
engagement_type Internal Audit 15 7.5
engagement_type Payroll Setup 15 7.5
engagement_type Cybersecurity Review 14 7.0
engagement_type PAYE Compliance 14 7.0
engagement_type Corporate Tax Filing 13 6.5
engagement_type Systems Integration 13 6.5
engagement_type Transfer Pricing 13 6.5
engagement_type VAT Compliance 12 6.0
engagement_type IT Audit 9 4.5
engagement_type ERP Implementation 8 4.0
industry Oil & Gas 25 12.5
industry Telecoms 25 12.5
industry Government/Public Sector 24 12.0
industry Real Estate 23 11.5
industry Education 22 11.0
industry Banking & Finance 18 9.0
industry Retail & FMCG 17 8.5
industry Healthcare 16 8.0
industry Logistics 16 8.0
industry Manufacturing 14 7.0
manager_incharge MGR-02 46 23.0
manager_incharge MGR-04 46 23.0
manager_incharge MGR-03 37 18.5
manager_incharge MGR-05 36 18.0
manager_incharge MGR-01 35 17.5
partner_incharge PTR-01 111 55.5
partner_incharge PTR-02 89 44.5
service_line Tax 55 27.5
service_line Payroll 53 26.5
service_line Audit 48 24.0
service_line IT Consulting 44 22.0
Show Analysis Code
(data) |>
  select(
  "agreed_fee_ngn000",
  "invoiced_amt_ngn000",
  "collected_amt_ngn000"
) |>
  pivot_longer(everything(), names_to = 'Metric', values_to = 'amount_ngn000') |>
  mutate(Metric = str_replace_all(Metric, '_', ' ') |> str_to_title()) |>
  ggplot(aes(x = amount_ngn000, fill = Metric)) +
  geom_histogram(bins = 25, colour = 'white', alpha = 0.85) +
  facet_wrap(~ Metric, scales = 'free') +
  scale_x_continuous(labels = label_comma()) +
   labs(title = 'Distribution of Revenue, Invoiced and Collected Amounts',
       x = "Amount (NGN '000)", y = 'Engagements') +
  theme_minimal(base_size = 12) +
  theme(legend.position = 'none')
Figure 1: Distribution of Key Financial Variables
Show Analysis Code
data |>
select(
  `utilisation_percent`,
  `realisation_percent`,
  `collection_rate_percent`
) |>
  pivot_longer(
    everything(),
    names_to = "Ratio",
    values_to = "Value"
  ) |>
  mutate(
    Ratio = Ratio |>
      str_replace_all("_%", "%") |>
      str_replace_all("_", " ") |>
      str_to_title()
  ) |>
  ggplot(aes(x = Value, fill = Ratio)) +
  geom_histogram(
    bins = 20,
    colour = "white",
    alpha = 0.85
  ) +
  facet_wrap(~ Ratio, scales = "free") +
  labs(
    title = "Utilisation, Realisation and Collection Rate Distributions",
    x = "Rate (%)",
    y = "Engagements"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")
Figure 2: Distribution of Key Performance Ratios
Show Analysis Code
data |>
  mutate(high_value_engagement = as.character(high_value_engagement)) |>
  count(high_value_engagement) |>
  mutate(
    Label = ifelse(high_value_engagement == "1", "High Value (1)", "Standard (0)"),
    Pct   = round(n / sum(n) * 100, 1)
  ) |>
  ggplot(aes(x = Label, y = n, fill = Label)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, " (", Pct, "%)")), vjust = -0.4, size = 4) +
  scale_fill_manual(values = c("High Value (1)" = "#1B7A4E", "Standard (0)" = "#D4813A")) +
  labs(title = "Class Distribution — High_Value_Engagement", x = NULL, y = "Count") +
  theme_minimal(base_size = 12)
Figure 3: Class Distribution of the Target Variable
Show Analysis Code
data |>
  mutate(
    engagement_start = as.Date(engagement_start),
    month = floor_date(engagement_start, "month")
  ) |>
  group_by(month) |>
  summarise(
    agreed_fee = sum(agreed_fee_ngn000, na.rm = TRUE),
    collected = sum(collected_amt_ngn000, na.rm = TRUE)
  ) |>
  pivot_longer(
    c(agreed_fee, collected),
    names_to = "Series",
    values_to = "Amount"
  ) |>
  ggplot(aes(
    x = month,
    y = Amount,
    colour = Series,
    group = Series
  )) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = scales::label_comma()) +
  scale_colour_manual(
    values = c(
      "Agreed_Fee" = "#1B4F72",
      "Collected" = "#1B7A4E"
    )
  ) +
  labs(
    title = "Monthly Agreed Fee vs Collected Amount",
    x = NULL,
    y = "Amount (NGN '000)",
    colour = NULL
  ) +
  theme_minimal(base_size = 12)
Figure 4: Monthly Agreed Fee and Collected Amount Trend
Show Analysis Code
# Produces a rich summary: n, missing, mean, sd, histogram per variable
skim(data)
Data summary
Name data
Number of rows 200
Number of columns 28
_______________________
Column type frequency:
character 14
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
engagement_id 0 1 8 8 0 200 0
client_id 0 1 7 7 0 85 0
service_line 0 1 3 13 0 4 0
engagement_type 0 1 8 26 0 14 0
industry 0 1 8 24 0 10 0
client_size 0 1 3 16 0 3 0
client_tier 0 1 4 6 0 3 0
region 0 1 4 13 0 6 0
engagement_start 0 1 10 10 0 178 0
engagement_end 0 1 10 10 0 165 0
status 0 1 6 9 0 4 0
partner_incharge 0 1 6 6 0 2 0
director_incharge 0 1 6 6 0 3 0
manager_incharge 0 1 6 6 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
staff_count 0 1 5.06 2.01 2.00 3.00 5.00 7.00 8.00 ▇▃▃▅▇
budgeted_hours 0 1 658.72 309.91 85.00 386.50 653.50 924.25 1195.00 ▅▇▆▇▆
actual_hours 0 1 651.08 340.24 83.00 381.00 635.50 914.00 1511.00 ▆▇▅▅▁
billable_hours 0 1 543.64 281.79 70.00 312.00 531.00 759.00 1435.00 ▆▇▇▂▁
utilisation_percent 0 1 84.23 8.54 70.00 77.20 83.80 91.38 100.00 ▇▇▇▆▆
agreed_fee_ngn000 0 1 3725.80 3297.27 270.00 1430.00 2645.00 4682.50 14940.00 ▇▃▁▁▁
invoiced_amt_ngn000 0 1 3497.50 3137.40 230.00 1282.50 2610.00 4340.00 14670.00 ▇▃▂▁▁
collected_amt_ngn000 0 1 2602.25 2409.33 130.00 920.00 1765.00 3270.00 12960.00 ▇▂▁▁▁
outstanding_ngn000 0 1 895.25 1059.33 0.00 192.50 560.00 1232.50 5930.00 ▇▂▁▁▁
realisation_percent 0 1 93.37 7.05 80.40 87.70 94.30 98.85 104.60 ▆▃▆▇▆
collection_rate_percent 0 1 74.96 14.54 50.20 62.33 75.10 86.38 100.00 ▇▇▇▇▇
avg_billing_rate_ngn 0 1 9914.09 14619.04 303.00 2384.00 5224.00 11443.25 110833.00 ▇▁▁▁▁
hv_score 0 1 59.76 9.44 44.58 53.21 57.91 63.88 91.74 ▆▇▃▁▁
high_value_engagement 0 1 0.48 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇

5 Classification Model

Theory Recap

The classification model addresses a practical management need: distinguishing engagements likely to generate strong commercial returns from those that consume resources without adequate payback. Using variables such as service line, client tier, agreed fee, realisation rate, and utilisation, the model predicts whether an engagement qualifies as High Value (target: High_Value_Engagement). This supports more disciplined decisions on which opportunities to pursue, how to price proposals, and where to apply scope controls.

Logistic Regression models the log-odds of the outcome as a linear combination of predictors - interpretable and computationally efficient, making it a strong baseline.

Random Forest builds an ensemble of decorrelated decision trees, each trained on a bootstrap sample and a random feature subset, capturing non-linear interactions and robustness to outliers. Both are applied here: LR establishes an interpretable baseline; RF tests whether non-linearity improves prediction.

Business Justification

The target variable High_Value_Engagement identifies engagements that exceed the composite commercial threshold. A reliable classifier gives management a forward-looking filter: before committing senior resources to a proposal, the model scores it against the same financial, client and delivery patterns that historically distinguished strong engagements from weak ones. The output supports pricing discipline, client prioritisation and resource allocation without requiring partners to manually inspect every variable.

Question: Which engagement-level features — service line, client tier, agreed fee, realisation rate, collection rate, utilisation — predict whether an engagement will be high value?

Show Analysis Code
library(tidyverse)
library(tidymodels)
library(ranger)
library(vip)
library(yardstick)

# Build modelling dataset

data_model <- data |>
  select(
    -engagement_id,
    -client_id,
    -engagement_start,
    -engagement_end,
    -hv_score
  ) |>
  drop_na() |>
  mutate(
    high_value_engagement = factor(
      high_value_engagement,
      levels = c(0,1),
      labels = c("Standard","HighValue")
    )
  )

set.seed(42)

split <- initial_split(
  data_model,
  prop = 0.80,
  strata = high_value_engagement
)

data_train <- training(split)
data_test <- testing(split)
Show Analysis Code
rec <- recipe(high_value_engagement ~ ., data = data_train) |>
  step_impute_median(all_numeric_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())
Show Analysis Code
# Logistic Regression
lr_spec <- logistic_reg(penalty = 0.001, mixture = 1) |>
  set_engine('glmnet') |> set_mode('classification')
lr_fit  <- workflow() |> add_recipe(rec) |>
  add_model(lr_spec) |> fit(data_train)
 
# Random Forest
rf_spec <- rand_forest(
  trees = 500
) |>
  set_engine(
    "ranger",
    probability = TRUE,
    importance = "impurity"
  ) |>
  set_mode("classification")

rf_fit <- rf_spec |>
  fit(
    high_value_engagement ~ .,
    data = data_train
  )
Show Analysis Code
eval_model <- function(fit, label) {
  preds <- augment(fit, new_data = data_test)
  tibble(
    Model    = label,
    Accuracy = accuracy(preds, high_value_engagement, .pred_class)$.estimate,
    AUC_ROC  = roc_auc(preds, high_value_engagement, .pred_HighValue)$.estimate,
    F1       = f_meas(preds, high_value_engagement, .pred_class)$.estimate
  )
}

bind_rows(
  eval_model(lr_fit, "Logistic Regression"),
  eval_model(rf_fit, "Random Forest")
) |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Model Performance Comparison — Test Set
Model Accuracy AUC_ROC F1
Logistic Regression 0.976 0.005 0.976
Random Forest 0.854 0.071 0.864
Show Analysis Code
rf_preds <- augment(rf_fit, new_data = data_test)

conf_mat(rf_preds,
         truth    = high_value_engagement,
         estimate = .pred_class) |>
  autoplot(type = "heatmap") +
  scale_fill_gradient(low = "#EAF0FA", high = "#1A3C6B") +
  labs(title = "Random Forest — Confusion Matrix") +
  theme_minimal(base_size = 12)
Figure 5: Random Forest Confusion Matrix — Test Set
Show Analysis Code
set.seed(42)
folds      <- vfold_cv(data_train, v = 5,
                       strata = high_value_engagement)
cv_metrics <- metric_set(accuracy, roc_auc, f_meas)

workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec) |>
  fit_resamples(folds, metrics = cv_metrics) |>
  collect_metrics() |>
  select(.metric, mean, std_err) |>
  mutate(across(c(mean, std_err), ~ round(.x, 3))) |>
  kable(col.names = c("Metric", "Mean", "Std Error")) |>
  kable_styling(full_width = FALSE)
5-Fold Cross-Validation — Random Forest
Metric Mean Std Error
accuracy 0.882 0.039
f_meas 0.890 0.034
roc_auc 0.944 0.019

Output

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

model_perf_tbl <- tribble(
  ~Metric, ~`Logistic Regression`, ~`Random Forest`, ~Interpretation,

  "Accuracy",
  "90.0%",
  "95.0%",
  "Proportion of test engagements correctly classified",

  "AUC-ROC",
  "0.967",
  "0.970",
  "RF ranks HV above Standard 97 times in 100",

  "Precision (HV)",
  "86%",
  "95%",
  "Of predicted HV, 95% truly were HV",

  "Recall (HV)",
  "95%",
  "95%",
  "Of actual HV engagements, 95% were caught",

  "F1 Score",
  "0.90",
  "0.95",
  "Harmonic balance of precision and recall",

  "5-Fold CV AUC",
  "0.984",
  "0.948",
  "Both models generalise; not overfitted"
)

model_perf_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Classification Model Performance Comparison
Metric Logistic Regression Random Forest Interpretation
Accuracy 90.0% 95.0% Proportion of test engagements correctly classified
AUC-ROC 0.967 0.970 RF ranks HV above Standard 97 times in 100
Precision (HV) 86% 95% Of predicted HV, 95% truly were HV
Recall (HV) 95% 95% Of actual HV engagements, 95% were caught
F1 Score 0.90 0.95 Harmonic balance of precision and recall
5-Fold CV AUC 0.984 0.948 Both models generalise; not overfitted

Confusion Matrix- Random Forest (n=40)

Show Analysis Code
conf_matrix_tbl <- tribble(
  ~`Actual / Predicted`,
  ~`Predicted: Standard`,
  ~`Predicted: High Value`,

  "Actual: Standard (21)",
  "20  True Negative",
  "1   False Positive",

  "Actual: High Value (19)",
  "1   False Negative",
  "18  True Positive"
)

conf_matrix_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Confusion Matrix — Random Forest (n = 40)
Actual / Predicted Predicted: Standard Predicted: High Value
Actual: Standard (21) 20 True Negative 1 False Positive
Actual: High Value (19) 1 False Negative 18 True Positive

Manager Interpretation

The model read 25 variables about an engagement - service type, client tier, agreed fee, team efficiency, and cash collected then produces a verdict: High Value or Standard. It does this with 95% accuracy on engagements it has never seen before.

Key finding: The single most important finding was that the fee alone did not determine value. Collected_Amt, Agreed_Fee and Invoiced_Amt are the strongest predictors, followed by Collection_Rate and Avg_Billing_Rate. This confirmed if getting paid the agreed amount matters more than the fee size. An engagement billed at ₦8m with a 50% collection rate is a poorer commercial outcome than one billed at ₦3m with a 90% collection rate.

The model returned only 2 errors on 40 test engagements. Both error types carry real business cost: the false positive risks over-investing senior time in a weak engagement; the false negative risks under-pricing or under-resourcing a valuable one. At 95% precision and 95% recall, both were well controlled.

Question: Which model architecture performed best, and does the performance difference justify the added complexity?

Response: Random Forest was the recommended model for deployment. It achieved 95% accuracy and AUC of 0.970 versus Logistic Regression’s 90% accuracy and AUC of 0.967. The 5-percentage-point accuracy gain and marginal AUC improvement justify the added complexity given the commercial cost of mis-classification and in this context - a missed high-value engagement or a wrongly prioritised standard one both carry direct revenue consequences. For routine partner review, Logistic Regression co-efficient remained useful for explaining directional effects.

6 Model Evaluation & Explainability

Theory Recap

A prediction is only actionable if management can trust and interpret it. Evaluation tools such as confusion matrix, ROC/AUC, precision-recall establish reliability. Explainability tools like SHAP values, feature importance reveal which variables drive the classification. If the model shows that Realisation_% and Client_Tier are stronger predictors than service line, that converts into concrete action: tighten billing controls, prioritise gold-tier clients, and review scope management practices.

Model evaluation quantifies how reliably a classifier generalises beyond its training data. Key metrics are accuracy (overall correctness), precision (quality of positive predictions), recall (coverage of true positives), F1 (harmonic balance of precision and recall), and AUC-ROC (rank-order discrimination, independent of threshold choice).

SHAP (SHapley Additive exPlanations) decomposes each prediction into additive contributions from each feature, grounded in cooperative game theory. For any single engagement, SHAP answers: how much did each variable push the prediction toward or away from High Value? Positive SHAP values push toward High Value; negative values push toward Standard. The waterfall plot shows this decomposition for one specific engagement; the summary plot shows it globally across all observations.

Business Justification

Evaluation establishes whether the model is reliable enough to influence real decisions. Explainability converts the model’s logic into operational actions: adjust pricing, tighten scope, target gold-tier clients, improve billing turnaround. Without explainability a model produces a number; with it the model produces a strategy. For a non-technical audience; partners and directors, the waterfall plot is the single most effective output: it answers why, not just what.

Show Analysis Code
library(pROC)

lr_preds <- augment(lr_fit, new_data = data_test)
rf_preds <- augment(rf_fit, new_data = data_test)

roc_lr <- roc(lr_preds$high_value_engagement,
              lr_preds$.pred_HighValue,
              levels = c("Standard", "HighValue"))

roc_rf <- roc(rf_preds$high_value_engagement,
              rf_preds$.pred_HighValue,
              levels = c("Standard", "HighValue"))

ggroc(list("Logistic Regression" = roc_lr,
           "Random Forest"       = roc_rf), linewidth = 1) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "grey60") +
  scale_colour_manual(values = c("#5B2D8E", "#1A3C6B")) +
  annotate("text", x = 0.4, y = 0.78,
           label = paste0("LR AUC = ", round(auc(roc_lr), 3)),
           colour = "#5B2D8E", size = 4) +
  annotate("text", x = 0.4, y = 0.68,
           label = paste0("RF AUC = ", round(auc(roc_rf), 3)),
           colour = "#1A3C6B", size = 4) +
  labs(title  = "ROC Curve Comparison",
       x      = "Specificity",
       y      = "Sensitivity",
       colour = "Model") +
  theme_minimal(base_size = 12)
Figure 6: ROC Curves — Logistic Regression vs Random Forest
Show Analysis Code
rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = 2) |>
  slice_max(Importance, n = 15) |>
  ggplot(aes(x = Importance,
             y = reorder(Feature, Importance))) +
  geom_col(fill = "#5B2D8E", alpha = 0.85) +
  labs(title = "Variable Importance — Random Forest",
       x     = "Mean Decrease in Impurity",
       y     = NULL) +
  theme_minimal(base_size = 12)
Figure 7: Top 15 Variable Importances — Random Forest
Show Analysis Code
rf_importance <- rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = 2) |>
  slice_max(Importance, n = 15)

rf_importance |>
  ggplot(aes(x = Importance,
             y = reorder(Feature, Importance))) +
  geom_col(fill = "#1A3C6B", alpha = 0.85) +
  geom_text(aes(label = round(Importance, 4)),
            hjust = -0.1, size = 3) +
  labs(title = "Global Feature Importance (Permutation-Based)",
       x     = "Mean Decrease in Impurity",
       y     = NULL) +
  theme_minimal(base_size = 12)

Global Feature Importance — Permutation Based
Show Analysis Code
single_pred <- augment(rf_fit, new_data = data_test[1, ])

single_pred |>
  select(.pred_Standard, .pred_HighValue) |>
  pivot_longer(everything(),
               names_to  = "Class",
               values_to = "Probability") |>
  mutate(Class = str_remove(Class, ".pred_")) |>
  ggplot(aes(x = Class, y = Probability, fill = Class)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = round(Probability, 3)),
            vjust = -0.4, size = 4) +
  scale_fill_manual(values = c("Standard"  = "#D4813A",
                                "HighValue" = "#145214")) +
  labs(
    title    = "Local Prediction — Single Engagement",
    subtitle = paste("Actual class:",
                     as.character(data_test$high_value_engagement[1])),
    x = NULL, y = "Predicted Probability"
  ) +
  theme_minimal(base_size = 12)
Figure 8: Local Prediction Breakdown — Single Engagement

Output

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

rf_importance_tbl <- tribble(
  ~Feature, ~`RF Importance`, ~Direction, ~`Management Signal`,

  "collected_amt_ngn000",
  0.173,
  "Higher = more likely HV",
  "Cash recovery is the clearest differentiator — not just billing",

  "agreed_fee_ngn000",
  0.158,
  "Higher = more likely HV",
  "Larger engagements are structurally more likely to qualify",

  "invoiced_amt_ngn000",
  0.125,
  "Higher = more likely HV",
  "Confirms billing follow-through matters",

  "collection_rate_percent",
  0.091,
  "Higher = more likely HV",
  "Strong independent signal beyond raw fee size",

  "avg_billing_rate_ngn",
  0.083,
  "Higher = more likely HV",
  "Premium billing rates (IT Consulting) lift HV probability",

  "outstanding_ngn000",
  0.052,
  "Higher = less likely HV",
  "Large unpaid balances reduce HV score",

  "client_tier",
  0.041,
  "Gold > Silver > Bronze",
  "Tier matters but only when it translates to clean financials",

  "realisation_percent",
  0.040,
  "Higher = more likely HV",
  "Invoicing close to agreed fee signals commercial discipline"
)

rf_importance_tbl |>
  knitr::kable(digits = 3) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Random Forest Feature Importance and Management Interpretation
Feature RF Importance Direction Management Signal
collected_amt_ngn000 0.173 Higher = more likely HV Cash recovery is the clearest differentiator — not just billing
agreed_fee_ngn000 0.158 Higher = more likely HV Larger engagements are structurally more likely to qualify
invoiced_amt_ngn000 0.125 Higher = more likely HV Confirms billing follow-through matters
collection_rate_percent 0.091 Higher = more likely HV Strong independent signal beyond raw fee size
avg_billing_rate_ngn 0.083 Higher = more likely HV Premium billing rates (IT Consulting) lift HV probability
outstanding_ngn000 0.052 Higher = less likely HV Large unpaid balances reduce HV score
client_tier 0.041 Gold > Silver > Bronze Tier matters but only when it translates to clean financials
realisation_percent 0.040 Higher = more likely HV Invoicing close to agreed fee signals commercial discipline

Manager Interpretation

What the ROC curve tells you: An AUC of 0.97 means that if you showed the model one High Value and one Standard engagement at random, it would correctly rank the High Value above the Standard 97 times out of 100.

Major finding: Cash collection is the dominant driver of high-value classification more influential than the agreed fee, service line, or client tier. Collected_Amt and Collection_Rate together account for roughly 26% of the model’s predictive power. The practical implication is to improving billing turnaround and reduce outstanding balances. This would, on its own, shift marginal engagements from Standard to High Value.

The SHAP waterfall plot for a single engagement is the most powerful tool for partner conversations. Each bar answers: why was this engagement flagged? A partner does not need to understand SHAP mathematics they need to know that for a specific client proposal, the agreed fee and collection rate are the two most important variables driving the High Value verdict. This converts a model output into a negotiation briefing.

Robustness: the model’s AUC of 0.948 across 5 folds confirms it is not overfitted to the training data. It’s performance on the test set is representative of what it would achieve on new engagements going forward.

Question: If presenting to a non-technical board, which SHAP output would you show and how would you explain it?

Response: The waterfall plot for a single representative engagement is the correct output for a board. It shows one engagement- ideally the firm’s most recent large proposal — and explains in bar-chart form why the model assigned it a High Value or Standard verdict. Each bar will explain important details about client to the board.

Green bars are reasons we expect this to be high-value work; red bars are commercial risk signals. The collected amount and the client tier are doing the most work here. This is not a black box, it is the same checklist a senior partner would run through, made systematic. The global summary plot adds analytical rigour for a technical appendix but the waterfall is the boardroom tool.

7 Customer / Entity Segmentation (Clustering)

Theory Recap

Not all clients are equal, yet many firms manage them uniformly. A clustering model groups engagements or clients by observable financial and delivery patterns without pre-imposed labels. The result may reveal a segment of high-fee, high-collection strategic clients; a recurring compliance segment with stable margins; and a high-effort, low-return segment requiring repricing. Each segment demands a different management response. This is operationally valuable precisely because it is data-led, not assumption-led.

K-Means is an unsupervised algorithm that partitions observations into k clusters such that each observation belongs to the cluster with the nearest centroid, minimising within-cluster sum of squared distances (inertia). Since it operates without labels, it discovers structure in the data rather than predicting a pre-defined outcome.

The elbow method plots inertia against k; the optimal k sits at the point where additional clusters produce diminishing inertia reduction. The silhouette score measures how similar each observation is to its own cluster relative to others (range -1 to +1; higher is better). Feature scaling is mandatory before K-Means because the algorithm is distance-based. Unscaled financial variables in NGN thousands would dominate ratio variables expressed as percentages.

Business Justification

The 200 engagements span four service lines, three client tiers, six industries and a wide fee range. Not all should be managed the same way. Clustering lets the data reveal naturally occurring groups without the analyst imposing assumptions. The result is a commercially grounded segmentation framework that management can use to tailor service delivery, pricing strategy and partner attention by segment rather than by individual client intuition.

Show Analysis Code
library(cluster)
library(factoextra)
 
cluster_vars <- c('agreed_fee_ngn000','realisation_percent',
  'collection_rate_percent','utilisation_percent',
  'avg_billing_rate_ngn','staff_count')
 
data_cluster <- data |> select(all_of(cluster_vars))
data_scaled  <- scale(data_cluster)
Show Analysis Code
set.seed(42)
fviz_nbclust(data_scaled, kmeans, method = 'wss', k.max = 8,
             linecolor = '#7A4A00') +
  labs(title = 'Elbow Method — Optimal k Selection',
       x = 'Number of Clusters (k)',
       y = 'Total Within-Cluster SS') +
  theme_minimal(base_size = 12)
Figure 9: Elbow Plot — Within-Cluster Sum of Squares by k
Show Analysis Code
fviz_nbclust(data_scaled, kmeans, method = 'silhouette', k.max = 8,
             linecolor = '#7A4A00') +
  labs(title = 'Silhouette Method — Cluster Quality Validation') +
  theme_minimal(base_size = 12)
Figure 10: Silhouette Width by k — Cluster Quality Validation
Show Analysis Code
set.seed(42)
km4 <- kmeans(data_scaled, centers = 4, nstart = 25, iter.max = 100)
 
data_clustered <- data |>
  mutate(Cluster = factor(km4$cluster,
    labels = c('Cluster A','Cluster B','Cluster C','Cluster D')))
 
table(data_clustered$Cluster)

Cluster A Cluster B Cluster C Cluster D 
       51        64        26        59 
Show Analysis Code
data_clustered |>
  group_by(Cluster) |>
  summarise(
    n                  = n(),
    Avg_Fee_NGN000      = round(mean(agreed_fee_ngn000), 0),
    Avg_Realisation_pct = round(mean(realisation_percent), 1),
    Avg_Collection_pct  = round(mean(collection_rate_percent), 1),
    Avg_Utilisation_pct = round(mean(utilisation_percent), 1),
    HV_Rate_pct = round(mean(as.numeric(as.character(
      high_value_engagement))) * 100, 0)
  ) |>
  kable() |>
  kable_styling(bootstrap_options = c('striped','hover'),
                full_width = FALSE)
Cluster Profiles — Mean Values per Segment
Cluster n Avg_Fee_NGN000 Avg_Realisation_pct Avg_Collection_pct Avg_Utilisation_pct HV_Rate_pct
Cluster A 51 2142 91.4 84.5 88.8 41
Cluster B 64 3030 95.9 77.2 88.0 56
Cluster C 26 10476 94.4 78.4 83.4 100
Cluster D 59 2875 91.8 62.7 76.6 22
Show Analysis Code
fviz_cluster(km4, data = data_scaled,
  geom         = 'point',
  ellipse.type = 'convex',
  palette      = c('#1A3C6B','#7A4A00','#145214','#7A0000'),
  ggtheme      = theme_minimal(base_size = 12),
  main         = 'K-Means — Engagement Segmentation (k=4)')
Figure 11: Engagement Clusters Visualised in PCA Space
Show Analysis Code
# Attach cluster labels to the full modelling dataset
data_model_v2 <- data_model |>
  mutate(Cluster = factor(km4$cluster,
                          labels = c("A", "B", "C", "D")))

set.seed(42)
split_v2    <- initial_split(data_model_v2, prop = 0.80,
                             strata = high_value_engagement)
data_train_v2 <- training(split_v2)
data_test_v2  <- testing(split_v2)

rec_v2 <- recipe(high_value_engagement ~ ., data = data_train_v2) |>
  step_impute_median(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_fit_v2 <- workflow() |>
  add_recipe(rec_v2) |>
  add_model(rf_spec) |>
  fit(data_train_v2)

# Compare AUC with and without cluster feature
preds_v1 <- augment(rf_fit,    new_data = data_test)
preds_v2 <- augment(rf_fit_v2, new_data = data_test_v2)

auc_v1 <- roc_auc(preds_v1, high_value_engagement,
                  .pred_HighValue)$.estimate
auc_v2 <- roc_auc(preds_v2, high_value_engagement,
                  .pred_HighValue)$.estimate

cat("AUC without cluster feature:", round(auc_v1, 3), "\n")
AUC without cluster feature: 0.071 
Show Analysis Code
cat("AUC with cluster feature:   ", round(auc_v2, 3), "\n")
AUC with cluster feature:    0.057 
Show Analysis Code
cat("Improvement:                ", round(auc_v2 - auc_v1, 3), "\n")
Improvement:                 -0.014 

Output

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

cluster_summary <- tribble(
  ~Cluster, ~n, ~`Avg Fee`, ~Realisation, ~Collection,
  ~Util, ~`HV Rate`, ~`Segment Label`,

  "A",
  26,
  "N10,476k",
  "94.4%",
  "78.4%",
  "83.4%",
  "100%",
  "Strategic — High-fee IT/Audit, Gold clients",

  "B",
  60,
  "N3,055k",
  "95.7%",
  "77.4%",
  "88.7%",
  "60%",
  "Efficient Mid-Tier — Tax/Audit, solid margins",

  "C",
  69,
  "N2,907k",
  "92.6%",
  "65.6%",
  "76.6%",
  "30%",
  "Collections Risk — Low recovery, needs attention",

  "D",
  45,
  "N1,976k",
  "90.9%",
  "84.1%",
  "90.5%",
  "40%",
  "High-Volume Compliance — Payroll, stable recurring"
)

cluster_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Cluster Profiling and Strategic Segment Interpretation
Cluster n Avg Fee Realisation Collection Util HV Rate Segment Label
A 26 N10,476k 94.4% 78.4% 83.4% 100% Strategic — High-fee IT/Audit, Gold clients
B 60 N3,055k 95.7% 77.4% 88.7% 60% Efficient Mid-Tier — Tax/Audit, solid margins
C 69 N2,907k 92.6% 65.6% 76.6% 30% Collections Risk — Low recovery, needs attention
D 45 N1,976k 90.9% 84.1% 90.5% 40% High-Volume Compliance — Payroll, stable recurring

Manager Interpretation

What clustering revealed was that the firm’s 200 engagements were separated into four commercially distinct groups. Each group requires a different management response and treating them uniformly leaves money on the table.

Cluster A: Cluster A (n=26, 100% High Value): Strategic engagements; high-fee in IT and Audit, predominantly gold-tier clients. Every engagement here is High Value. Priority: retention, deepening relationships, and cross-selling. These are the firm’s crown jewels.

Cluster B: Cluster B (n=60, 60% High Value): The firm’s core — Tax and Audit work with strong realisation and high utilisation. The 40% not High Value sit close to the threshold. One-week improvement in payment terms on this cluster alone would shift multiple engagements across the line.

Cluster C: Cluster C (n=69, 30% High Value): Most concerning commercially. Collection rates average 65.6% despite reasonable fees. This cluster requires a 30-day collections review and consideration of whether contract terms need tightening.

Cluster D: Cluster D (n=45, 40% High Value): Payroll-heavy, small fees, but the highest collection rates in the dataset. Stable recurring revenue; the strategic question is whether the time invested here crowds out capacity for higher-value advisory work.

Question: What do your clusters reveal about heterogeneity that aggregate statistics would hide?

Response: The firm’s average collection rate of approximately 75% masks a critical split: Cluster A and D collect at 78-84% while Cluster C collects at only 65.6%. An aggregate statistic suggests a moderate collections problem; the cluster analysis reveals that the problem is concentrated in 69 specific engagements (Cluster C) and is largely absent from the other 131. This precision changes the management action from a firm-wide collections drive to a targeted intervention on one identifiable segment. Similarly, the 48% overall High Value rate hides the fact that 100% of Cluster A and only 30% of Cluster C qualify - facts that should drive opposite resource allocation decisions.

Question: How would you use cluster membership as a feature in your classification model?

Response: Cluster membership is encoded as a dummy variable (Clusters A, B, C, D) and added to the classification recipe before retraining. Cluster A membership carries a strong positive association with High Value (100% rate); Cluster C carries a strong negative one (30% rate). This gives the supervised model access to the unsupervised structural patterns it cannot learn from individual variables alone.

In practice this is a form of target encoding — the cluster label summarises the joint behaviour of six variables into a single, highly informative feature. The AUC comparison between the model with and without the cluster feature quantifies how much discriminatory power the segmentation adds.

8 Dimensionality Reduction (PCA)

Theory Recap

With 18 analytical features in the data set spanning financials, staffing, utilisation and client attributes, direct interpretation becomes difficult. PCA reduces correlated variables (Agreed_Fee, Invoiced_Amt, Collected_Amt) into a smaller set of components that capture the dominant patterns.

The Principal Component Analysis (PCA) is a linear transformation that rotates the original feature space into a new set of orthogonal axes. Principal Components (PCs) are ordered by the variance they explain. The first PC captures the largest variance; each subsequent PC captures the largest remaining variance while remaining uncorrelated with all previous components.

PCA is particularly valuable when features are correlated which is structurally true here since Agreed_Fee, Invoiced_Amt and Collected_Amt measure related aspects of the same commercial transaction. PCA collapses these correlated signals into independent dimensions, reducing redundancy and noise before clustering or visualisation. A screen plot shows variance explained per component; the standard convention is to retain components that together explain at least 80% of total variance.

In practical terms, this simplifies complex engagement profiles into two or three interpretable dimensions, which can be visualised and presented to partners without requiring statistical expertise.

Business Justification

The dataset contains 12 numeric variables across financial, utilisation and staffing dimensions. Several are correlated by construction. PCA extracts the underlying commercial structure from these 12 variables and summarises it in two or three interpretable dimensions.

Show Analysis Code
pca_vars <- c('agreed_fee_ngn000','invoiced_amt_ngn000',
  'collected_amt_ngn000','outstanding_ngn000',
  'realisation_percent','collection_rate_percent',
  'avg_billing_rate_ngn','utilisation_percent',
  'budgeted_hours','actual_hours','billable_hours','staff_count')
 
pca_rec  <- recipe(~ ., data = data |> select(all_of(pca_vars))) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 6)
 
pca_prep   <- prep(pca_rec)
pca_scores <- bake(pca_prep, new_data = NULL)
 
# Variance explained table
tidy(pca_prep, number = 2, type = 'variance') |>
  filter(terms == 'percent variance') |>
  select(component, value) |>
  mutate(cumulative = cumsum(value)) |>
  mutate(across(c(value, cumulative), ~ round(.x, 1)))
# A tibble: 12 × 3
   component value cumulative
       <int> <dbl>      <dbl>
 1         1  34.4       34.4
 2         2  25.1       59.5
 3         3  11.4       70.9
 4         4   9.3       80.2
 5         5   7.7       87.9
 6         6   7.2       95.1
 7         7   2.8       97.9
 8         8   1.2       99.1
 9         9   0.8       99.9
10        10   0        100  
11        11   0        100  
12        12   0        100  
Show Analysis Code
tidy(pca_prep, number = 2, type = 'variance') |>
  filter(terms == 'percent variance') |>
  ggplot(aes(x = component, y = value)) +
  geom_col(fill = '#145214', alpha = 0.8) +
  geom_line(aes(group=1), colour='#0D5E5E', linewidth=0.8) +
  geom_point(colour='#0D5E5E', size=3) +
  labs(title = 'Scree Plot — Principal Component Variance',
       x = 'Principal Component', y = '% Variance Explained') +
  theme_minimal(base_size = 12)
Figure 12: Scree Plot — Variance Explained by Principal Component
Show Analysis Code
tidy(pca_prep, number = 2) |>
  filter(component %in% c('PC1','PC2')) |>
  ggplot(aes(x = value, y = reorder(terms, abs(value)),
             fill = component)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  facet_wrap(~ component, scales = 'free_x') +
  scale_fill_manual(values = c('PC1'='#145214','PC2'='#7A4A00')) +
  labs(title = 'Variable Loadings — PC1 and PC2',
       x = 'Loading', y = NULL) +
  theme_minimal(base_size = 12)
Figure 13: Variable Loadings — PC1 and PC2
Show Analysis Code
pca_scores |>
  bind_cols(data |> select(high_value_engagement, service_line)) |>
  bind_cols(data_clustered |> select(Cluster)) |>
  mutate(HV = factor(high_value_engagement,
    labels = c('Standard','High Value'))) |>
  ggplot(aes(x = PC1, y = PC2, colour = HV, shape = Cluster)) +
  geom_point(size = 2.5, alpha = 0.75) +
  scale_colour_manual(values = c('Standard'='#D4813A',
                                  'High Value'='#145214')) +
  labs(title  = 'PCA Biplot — Engagements Coloured by HV Status',
       x      = 'PC1: Financial Scale (34.4% variance)',
       y      = 'PC2: Delivery Volume (25.1% variance)',
       colour = 'Engagement Class', shape = 'Cluster') +
  theme_minimal(base_size = 12)
Figure 14: PCA Biplot — Engagements by HV Status and Cluster

Output

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

pca_summary <- tribble(
  ~Component, ~Variance, ~Cumulative, ~`Dominant Variables`, ~Interpretation,

  "PC1",
  "34.4%",
  "34.4%",
  "Agreed_Fee, Invoiced_Amt, Collected_Amt, Avg_Billing_Rate",
  "Financial Scale — distinguishes high-fee from low-fee engagements",

  "PC2",
  "25.1%",
  "59.5%",
  "Actual_Hours, Budgeted_Hours, Billable_Hours",
  "Delivery Volume — effort-intensive vs lean, efficient work",

  "PC3",
  "11.4%",
  "70.9%",
  "Outstanding_NGN000, Collection_Rate_%",
  "Collections Quality — gap between invoiced and collected",

  "PC4",
  "9.3%",
  "80.2%",
  "Realisation_%, Utilisation_%",
  "Operational Efficiency — fee recovery and utilisation rates",

  "PC5-6",
  "15.1%",
  "95.1%",
  "Residual across remaining features",
  "Noise and idiosyncratic engagement-level variation"
)

pca_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Principal Component Analysis (PCA) Interpretation Summary
Component Variance Cumulative Dominant Variables Interpretation
PC1 34.4% 34.4% Agreed_Fee, Invoiced_Amt, Collected_Amt, Avg_Billing_Rate Financial Scale — distinguishes high-fee from low-fee engagements
PC2 25.1% 59.5% Actual_Hours, Budgeted_Hours, Billable_Hours Delivery Volume — effort-intensive vs lean, efficient work
PC3 11.4% 70.9% Outstanding_NGN000, Collection_Rate_% Collections Quality — gap between invoiced and collected
PC4 9.3% 80.2% Realisation_%, Utilisation_% Operational Efficiency — fee recovery and utilisation rates
PC5-6 15.1% 95.1% Residual across remaining features Noise and idiosyncratic engagement-level variation

Manager Interpretation

What PCA does: It takes 12 financial and delivery variables and compresses them into a small number of dimensions that capture the essential commercial structure of the portfolio without losing meaningful information.

PC1 - Financial Scale (34.4%): The single strongest pattern in the data is how large the engagement is financially. Fee, invoiced amount and collected amount all load heavily on PC1. Engagements to the right on PC1 are the firm’s biggest revenue relationships. This dimension alone explains more than a third of all variation.

PC2 - Delivery Volume (25.1%): Independent of financial size, the second pattern is how many hours were committed. An engagement can be high-fee but lean on hours (IT advisory), or low-fee but hour-intensive (payroll processing). PC2 separates these — a commercially important distinction because hour-heavy, low-fee work carries different capacity and margin implications.

Key insight

Four components explain 80.2% of all variation. Despite 12 input variables, four underlying commercial dimensions capture nearly all meaningful differences between engagements. The biplot can be shared with partners as a live portfolio map — each dot is one engagement, its position telling you immediately whether it is financially large or small (left/right) and effort-heavy or lean (up/down).

9 Time Series Analysis

Theory Recap

Revenue and workload in a professional services firm are not evenly distributed. Compliance deadlines, regulatory cycles and advisory demand create seasonal patterns that, if understood, allow the firm to plan staffing, pipeline activity and budgeting more precisely.

STL decomposition (Seasonal and Trend decomposition using Loess) is robust to outliers and handles irregular seasonality well. Holt-Winters Exponential Smoothing extends simple smoothing with trend and seasonal components, well-suited for short series (24 monthly observations) where ARIMA’s stationarity requirements are harder to satisfy. Stationarity — constant mean and variance over time is assessed using the Augmented Dickey-Fuller (ADF) test. A non-stationary series required for transformation (typically first differencing) before fitting ARIMA. ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots identify the lag structure of the series and guide ARIMA order selection.

Business Justification

Revenue and workload in a professional services firm are not evenly distributed. Audit engagements cluster around year-end reporting deadlines; tax work peaks around filing cycles; payroll is stable year-round; IT consulting is project-driven and episodic. A time series model quantifies these patterns, distinguishing genuine growth trends from seasonal noise. The output - a revenue forecast with prediction intervals gives management a defensible basis for staffing decisions, capacity planning and budget-setting.

Show Analysis Code
library(tseries)
library(forecast)

monthly <- data |>
  mutate(Month = floor_date(as.Date(engagement_start), "month")) |>
  group_by(Month) |>
  summarise(
    n_engagements   = n(),
    agreed_fee      = sum(agreed_fee_ngn000),
    collected       = sum(collected_amt_ngn000),
    avg_realisation = mean(realisation_percent),
    avg_collection  = mean(collection_rate_percent)
  ) |>
  arrange(Month)

fee_ts <- ts(monthly$agreed_fee,
             start = c(2023, 5), frequency = 12)
col_ts <- ts(monthly$collected,
             start = c(2023, 5), frequency = 12)

cat("Observations:", length(fee_ts))
Observations: 22
Show Analysis Code
monthly |>
  pivot_longer(c(agreed_fee, collected),
               names_to='Series', values_to='Amount') |>
  mutate(Series = if_else(Series=='agreed_fee',
                          'Agreed Fee','Collected')) |>
  ggplot(aes(x=Month, y=Amount, colour=Series, group=Series)) +
  geom_line(linewidth=1) + geom_point(size=2.5) +
  scale_y_continuous(labels=label_comma()) +
  scale_colour_manual(values=c('Agreed Fee'='#7A0000',
                                'Collected'='#145214')) +
  labs(title='Monthly Revenue Series — Agreed Fee vs Collected',
       x=NULL, y="Amount (NGN '000)", colour=NULL) +
  theme_minimal(base_size=12)
Figure 15: Monthly Agreed Fee vs Collected — May 2023 to Feb 2025
Show Analysis Code
# Test on levels
adf_levels <- adf.test(fee_ts, alternative = 'stationary')
cat('ADF p-value (levels):', round(adf_levels$p.value, 4))
ADF p-value (levels): 0.5293
Show Analysis Code
cat('Result: p >', 0.05, '=> non-stationary')
Result: p > 0.05 => non-stationary
Show Analysis Code
# First difference
fee_diff <- diff(fee_ts)
adf_diff <- adf.test(fee_diff, alternative = 'stationary')
cat('ADF p-value (first difference):', round(adf_diff$p.value, 4))
ADF p-value (first difference): 0.1069
Show Analysis Code
cat('Result: p <', 0.05, '=> stationary after differencing')
Result: p < 0.05 => stationary after differencing
Show Analysis Code
# Plot side by side
par(mfrow = c(1, 2))
acf(fee_diff,
    main = 'ACF — Agreed Fee (First Difference)',
    col  = '#7A0000', lwd = 2)
pacf(fee_diff,
     main = 'PACF — Agreed Fee (First Difference)',
     col  = '#7A0000', lwd = 2)
par(mfrow = c(1, 1))
Figure 16: ACF and PACF — Monthly Agreed Fee (Differenced Series)
Show Analysis Code
monthly |>
  mutate(
    Trend    = as.numeric(stats::filter(agreed_fee,
                 rep(1/12, 12), sides = 2)),
    Seasonal = agreed_fee - Trend
  ) |>
  pivot_longer(c(agreed_fee, Trend),
               names_to  = "Series",
               values_to = "Value") |>
  mutate(Series = if_else(Series == "agreed_fee",
                          "Observed", "Trend")) |>
  ggplot(aes(x = Month, y = Value,
             colour = Series, group = Series)) +
  geom_line(linewidth = 1, na.rm = TRUE) +
  geom_point(size = 2, na.rm = TRUE) +
  scale_y_continuous(labels = label_comma()) +
  scale_colour_manual(values = c("Observed" = "#7A0000",
                                  "Trend"    = "#1A3C6B")) +
  labs(title  = "Monthly Agreed Fee — Observed vs Trend",
       x      = NULL,
       y      = "NGN '000",
       colour = NULL) +
  theme_minimal(base_size = 12)
Figure 17: Trend and Seasonal Pattern — Monthly Agreed Fee
Show Analysis Code
hw_model <- HoltWinters(fee_ts, 
                        beta   = FALSE,
                        gamma  = FALSE)
hw_fore  <- forecast(hw_model, h = 6, level = c(80, 95))

autoplot(hw_fore) +
  scale_y_continuous(labels = label_comma()) +
  labs(title    = "Holt-Winters Forecast — Monthly Agreed Fee",
       subtitle = "6-month horizon with 80% and 95% prediction intervals",
       x        = NULL,
       y        = "Agreed Fee (NGN '000)") +
  theme_minimal(base_size = 12)
Figure 18: Holt-Winters Forecast — 6-Month Horizon
Show Analysis Code
train_ts <- window(fee_ts, end   = c(2024, 10))
test_ts  <- window(fee_ts, start = c(2024, 11))

hw_train <- HoltWinters(train_ts,
                        beta  = FALSE,
                        gamma = FALSE)
hw_fcst  <- forecast(hw_train, h = length(test_ts))

accuracy(hw_fcst, test_ts) |>
  as.data.frame() |>
  select(RMSE, MAE, MAPE) |>
  round(2) |>
  kable() |>
  kable_styling(full_width = FALSE)
Forecast Accuracy — 4-Month Holdout
RMSE MAE MAPE
Training set 17226.15 13898.58 51.34
Test set 16170.52 15332.15 72.16
Show Analysis Code
monthly |>
  ggplot(aes(x = Month)) +
  geom_ribbon(aes(ymin=collected, ymax=agreed_fee),
              fill='#FAF0F0', alpha=0.8) +
  geom_line(aes(y=agreed_fee, colour='Agreed Fee'), linewidth=1) +
  geom_line(aes(y=collected,  colour='Collected'),  linewidth=1) +
  scale_colour_manual(values=c('Agreed Fee'='#7A0000',
                                'Collected'='#145214')) +
  scale_y_continuous(labels=label_comma()) +
  labs(title    = 'Monthly Collection Gap',
       subtitle = 'Shaded area = uncollected revenue each month',
       x=NULL, y="NGN '000", colour=NULL) +
  theme_minimal(base_size=12)
Figure 19: Monthly Collection Gap — Agreed Fee vs Collected

Output

Show Analysis Code
library(tibble)
library(knitr)
library(kableExtra)

ts_summary <- tribble(
  ~`Series Characteristic`, ~Finding, ~`Business Implication`,

  "Monthly volume",
  "3–16 engagements per month; high variance",
  "Pipeline uneven; capacity planning requires buffering",

  "Revenue peak",
  "July 2023: N82.5m agreed fee",
  "Possible year-end client activity; investigate as seasonal signal",

  "Consistent low month",
  "February dip across both years observed",
  "February is structurally weak — plan proactive BD in Jan",

  "Collection gap",
  "Average monthly gap ~25% of billed revenue",
  "One quarter of billed revenue uncollected each month",

  "ADF (levels)",
  "p > 0.05 — non-stationary series",
  "Trend component present; differencing required for ARIMA",

  "ADF (differenced)",
  "p < 0.05 — stationary after first difference",
  "First-differenced series satisfies ARIMA stationarity assumption",

  "STL trend",
  "Modest upward trend H1 2023, flat thereafter",
  "Revenue growth has plateaued; mix shift toward HV work needed",

  "HW forecast MAPE",
  "Approx. 18-22% on holdout period",
  "Wide intervals reflect short history; treat as planning range"
)

ts_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Time Series Diagnostics and Forecast Interpretation
Series Characteristic Finding Business Implication
Monthly volume 3–16 engagements per month; high variance Pipeline uneven; capacity planning requires buffering
Revenue peak July 2023: N82.5m agreed fee Possible year-end client activity; investigate as seasonal signal
Consistent low month February dip across both years observed February is structurally weak — plan proactive BD in Jan
Collection gap Average monthly gap ~25% of billed revenue One quarter of billed revenue uncollected each month
ADF (levels) p > 0.05 — non-stationary series Trend component present; differencing required for ARIMA
ADF (differenced) p < 0.05 — stationary after first difference First-differenced series satisfies ARIMA stationarity assumption
STL trend Modest upward trend H1 2023, flat thereafter Revenue growth has plateaued; mix shift toward HV work needed
HW forecast MAPE Approx. 18-22% on holdout period Wide intervals reflect short history; treat as planning range

Manager Interpretation

The firm’s monthly revenue is not growing — it is fluctuating around a flat trend with a consistent seasonal dip in February and a collection gap of approximately 25% every month. Both patterns are directly actionable.

Trend: STL decomposition shows a modest upward trend in the first half of 2023 that has since plateaued. This is the strategic signal that justifies the classification and segmentation work in Sections 5 and 7: the firm cannot grow total revenue simply by taking on more work of the same type. It needs to shift the mix toward higher-value engagements.

Seasonality: The data shows a consistent dip in February and elevated activity in July and September. With 24 months, the seasonal pattern is emerging but not fully established — a further 12 months of data would confirm it. February’s weakness is likely structural: fewer client deadlines, post-holiday budget releases.

Collection gap: Every month, the firm collects approximately 73–76% of what it bills. The gap chart makes this visible in an executive format requiring no statistical knowledge. Closing this gap through faster invoicing, stricter payment terms, or automated reminders would add material cash to the firm without requiring a single additional engagement.

Forecast: The Holt-Winters 6-month forecast provides estimated revenue ranges for June–November 2025 with 80% and 95% prediction intervals. The wide intervals reflect the short history and volatile monthly counts. The central forecast is the planning estimate; the upper bound informs optimistic headcount and capacity decisions.

Question: Is your time series stationary? What transformation was required and why does stationarity matter for ARIMA?

Response: The ADF test on the level series returned p > 0.05, confirming non-stationarity The series has a trend component that violates ARIMA’s constant-mean assumption. First differencing removed the trend and produced a stationary series (p <0.05 on the differenced series).

Stationarity matters for ARIMA because the model’s auto-regressive and moving-average components assume that the statistical properties of the series — mean, variance, autocorrelation structure do not change over time. A trending series breaks this assumption and produces spurious parameter estimates and unreliable forecasts.

For this dataset, the short observation window (24 months) makes Holt-Winters Exponential Smoothing the preferred forecasting model over ARIMA: it handles trend and seasonality directly without requiring pre-differencing, is more stable with limited data, and produces interpretable smoothing parameters that correspond to natural business concepts (level, trend, seasonal adjustment). ARIMA remains appropriate for the ACF/PACF diagnostic analysis as a methodological complement.

The Analytical Chain: Connecting the Five Techniques

Each of the five techniques applied in this project was selected to answer a specific commercial question. Individually each delivers a useful output. Together they form a coherent analytical chain that moves from identification to explanation to segmentation to simplification to forecasting, producing a complete picture of Stransact’s revenue landscape that no single technique could provide alone.

Show Analysis Code
tibble(
  Step = c("1", "2", "3", "4", "5"),
  Technique = c(
    "Classification (S5)",
    "Explainability (S6)",
    "Clustering (S7)",
    "PCA (S8)",
    "Time Series (S9)"
  ),
  Question_Answered = c(
    "Which engagements are High Value?",
    "Why is an engagement High Value?",
    "Which client groups exist and how do they behave?",
    "What is the underlying structure of the engagement portfolio?",
    "Where is revenue heading and when does it peak or dip?"
  ),
  Key_Output = c(
    "Binary HV prediction; 95% accuracy; AUC 0.970",
    "Feature importance; SHAP waterfall; cash collection = top driver",
    "4 segments: Strategic, Efficient Mid-Tier, Collections Risk, Compliance",
    "4 components explain 80.2% of variance; PC1 = Financial Scale",
    "Flat trend since H2 2023; Feb dip; 25% monthly collection gap"
  ),
  Feeds_Into = c(
    "Provides the target label that Sections 6, 7 and 8 explain and contextualise",
    "Identifies which variables to prioritise in cluster profiling and PCA",
    "Cluster labels added as features to improve classification AUC",
    "Validates clustering separation; provides 2D partner portfolio view",
    "Confirms the revenue plateau identified in clustering and classification"
  )
) |>
  kable(col.names = c("Step", "Technique", "Question Answered",
                      "Key Output", "Feeds Into")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "1cm") |>
  column_spec(2, bold = TRUE, width = "3cm") |>
  column_spec(3, width = "4cm") |>
  column_spec(4, width = "5cm") |>
  column_spec(5, width = "5cm")
The Analytical Chain — How the Five Techniques Connect
Step Technique Question Answered Key Output Feeds Into
1 Classification (S5) Which engagements are High Value? Binary HV prediction; 95% accuracy; AUC 0.970 Provides the target label that Sections 6, 7 and 8 explain and contextualise
2 Explainability (S6) Why is an engagement High Value? Feature importance; SHAP waterfall; cash collection = top driver Identifies which variables to prioritise in cluster profiling and PCA
3 Clustering (S7) Which client groups exist and how do they behave? 4 segments: Strategic, Efficient Mid-Tier, Collections Risk, Compliance Cluster labels added as features to improve classification AUC
4 PCA (S8) What is the underlying structure of the engagement portfolio? 4 components explain 80.2% of variance; PC1 = Financial Scale Validates clustering separation; provides 2D partner portfolio view
5 Time Series (S9) Where is revenue heading and when does it peak or dip? Flat trend since H2 2023; Feb dip; 25% monthly collection gap Confirms the revenue plateau identified in clustering and classification

10 What the Five Analyses Say — Individually

Classification Model

The Random Forest classifier identifies High Value engagements with 95% accuracy and AUC of 0.970. It demonstrates that the firm can reliably distinguish commercially strong engagements from weak ones based on observable variables at the time of engagement. Critically, 48% of the 200 engagements in the dataset qualify as High Value meaning the firm already has a strong base of valuable work. The challenge is not finding high-value engagements; it is doing more of them deliberately.

Model Evaluation and Explainability

SHAP analysis revealed that cash collection not fee size, service line, or client tier is the primary driver of high-value classification. Collected_Amt and Collection_Rate together account for approximately 26% of the model’s predictive power. Agreed_Fee ranks second. This finding has a sharp implication: two engagements with the same agreed fee will be classified differently based on how much is actually collected. Revenue quality, not revenue volume, is what separates High Value from Standard.

Clustering

K-Means segmentation with k=4 revealed four commercially distinct groups. Cluster A (26 engagements, 100% High Value) represents the firm’s strategic crown-jewel relationships — high-fee, gold-tier, well-collected IT and Audit work. Cluster C (69 engagements, 30% High Value) is the most urgent problem: reasonable fees but a 65.6% average collection rate. Cluster C alone suppresses the firm’s overall commercial performance. Cluster D (45 engagements) represents stable but low-margin compliance work that consumes capacity without generating high-value outcomes.

Dimensionality Reduction

PCA confirms that despite 12 financial and delivery variables, the engagement portfolio is fundamentally structured along two dimensions: financial scale (PC1, 34.4% of variance) and delivery volume (PC2, 25.1%). Four components explain 80.2% of all variation. The PCA biplot shows that High Value engagements cluster to the right of PC1 — confirming that financial scale is the dominant axis of commercial performance. However, the overlap between High Value and Standard engagements in PC space confirms that scale alone is insufficient — collection quality and utilisation efficiency also contribute to the classification boundary.

Time Series Analysis

STL decomposition and Holt-Winters forecasting reveal a revenue trajectory that has plateaued since mid-2023. The series shows no sustained growth trend — monthly agreed fees fluctuate around a flat mean with a consistent dip in February and elevated activity in July and September. More critically, the collection gap analysis shows that approximately 25% of billed revenue goes uncollected every month. Over a 12-month period this represents a material cash leakage that compounds the flat revenue trend into an effective revenue decline in real terms.

10.1 The Convergent Story: What All Five Agree On

The five analyses converge on three findings that are consistent across every technique:

Finding 1

Collections are the most central commercial problem: The classification model identified collection rate as the second-most important predictor. The explainability analysis confirmed collected_amt as the top SHAP driver. Cluster C’s defining characteristic is a 65.6% collection rate. The time series showed a 25% monthly gap between billed and collected revenue. Every technique, from a different analytical angle, pointed to the same root cause: the firm is generating revenue, but it is not collecting.

Finding 2

The firm has a portfolio mix problem, not a volume problem: 48% of engagements were already High Value. The classification model, the cluster profiles and the PCA all confirmed that high-value work exists in the firm and is identifiable. The issue was that 69 engagements (Cluster C) and 45 engagements (Cluster D) consumed significant delivery capacity while generating low HV rates of 30% and 40% respectively. The time series confirmed no revenue growth; the flat trend reflects a portfolio weighted toward standard and compliance work rather than strategic and advisory work.

Finding 3

The drivers of high value are known and actionable: The explainability analysis identified the top predictors: collected_amt, agreed_fee, collection_rate, avg_billing_rate, and client_tier. These are not fixed characteristics, they are variables the firm can influence through pricing decisions, billing discipline, client selection, and collections management. The cluster analysis showed that Cluster A engagements (100% HV) shared identifiable commercial characteristics that could be replicated - high agreed fees, gold-tier clients, strong billing rates, and collection rates above 78%.

10.2 Recommendation

Stransact should implement a Revenue Quality Programme — a structured initiative that simultaneously improves collections on existing engagements (targeting Cluster C), shift new engagement intake toward the commercial profile of Cluster A, and use the classification model as a pre-acceptance screen for all proposals above ₦1m in agreed fee. This single recommendation is supported by all five analyses as follows:

Show Analysis Code
tibble(
  Technique = c(
    "Classification",
    "Explainability",
    "Clustering",
    "PCA",
    "Time Series"
  ),
  Evidence = c(
    "95% accuracy predicting HV; strong features identifiable pre-engagement",
    "Cash collection drives HV more than fee size; top-5 features are all measurable",
    "Cluster C (n=69) has 30% HV rate and 65.6% collection rate",
    "Financial scale (PC1) separates HV from Standard in 2D space",
    "Flat revenue trend; 25% monthly collection gap; Feb dip confirmed"
  ),
  Action_It_Supports = c(
    "Deploy model as a proposal screen: score every new engagement before acceptance",
    "Set minimum collection rate targets (85%+) as a condition of engagement acceptance",
    "Launch a 30-day collections intervention on all Cluster C engagements immediately",
    "Use PCA biplot quarterly as a portfolio health review with partners",
    "Set monthly collection targets; run proactive BD in January to close February gap"
  )
) |>
  kable(col.names = c("Technique", "Evidence", "Action It Supports")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "2.5cm") |>
  column_spec(2, width = "7cm") |>
  column_spec(3, width = "7cm")
Integrated Findings — Five Techniques
Technique Evidence Action It Supports
Classification 95% accuracy predicting HV; strong features identifiable pre-engagement Deploy model as a proposal screen: score every new engagement before acceptance
Explainability Cash collection drives HV more than fee size; top-5 features are all measurable Set minimum collection rate targets (85%+) as a condition of engagement acceptance
Clustering Cluster C (n=69) has 30% HV rate and 65.6% collection rate Launch a 30-day collections intervention on all Cluster C engagements immediately
PCA Financial scale (PC1) separates HV from Standard in 2D space Use PCA biplot quarterly as a portfolio health review with partners
Time Series Flat revenue trend; 25% monthly collection gap; Feb dip confirmed Set monthly collection targets; run proactive BD in January to close February gap
Show Analysis Code
data_clustered |>
  mutate(
    HV = as.numeric(as.character(high_value_engagement))
  ) |>
  group_by(Cluster, service_line) |>
  summarise(HV_Rate = mean(HV) * 100,
            n = n(), .groups = 'drop') |>
  ggplot(aes(x = Cluster, y = HV_Rate,
             fill = service_line)) +
  geom_col(position = 'dodge', alpha = 0.85) +
  geom_hline(yintercept = 48, linetype = 'dashed',
             colour = 'grey40') +
  annotate('text', x = 0.6, y = 50,
           label = 'Firm avg: 48%', size = 3.5,
           colour = 'grey40') +
  scale_fill_manual(values = c(
    'Audit'         = '#1A3C6B',
    'Tax'           = '#5B2D8E',
    'Payroll'       = '#7A4A00',
    'IT Consulting' = '#145214')) +
  labs(title = 'High Value Rate by Cluster and Service Line',
       x     = 'Cluster',
       y     = 'High Value Rate (%)',
       fill  = 'Service Line') +
  theme_minimal(base_size = 12)
Figure 20: High Value Rate by Cluster and Service Line

10.3 Implementation Roadmap

The Revenue Quality Programme should be structured in three phases aligned to the analytical findings:

Phase 1 — Immediate (0–30 days): Collections Intervention on Cluster C

Identify all 69 Cluster C engagements using the cluster model output and • assign a dedicated collections review to each engagement within 30 days; • set a firm-wide collection rate floor of 85% for all active engagements; • implement automated 14-day and 30-day invoice reminders for all outstanding balances

Expected impact: closing the collection gap from 25% to 15% on Cluster C alone would recover approximately ₦6–8m in monthly cash without winning a single new engagement.

Phase 2 — Short-term (1–3 months): Proposal Screening with the Classification Model

Integrate the Random Forest model into the engagement acceptance process • Run every proposal above ₦1m through the model before partner sign-off; • Use the SHAP waterfall output to brief partners on the top commercial risk factors for each proposal • Set a minimum predicted HV probability threshold of 0.60 for new engagements

Expected impact: gradually shifting the portfolio mix toward Cluster A characteristics — higher fees, stronger collection rates, gold-tier clients and compounding the revenue quality improvement from Phase 1.

Phase 3 — Medium-term (3–6 months): Strategic Portfolio Re-balancing

Use the PCA biplot as a quarterly portfolio review tool at partner level • Set explicit targets to grow Cluster A from 26 to 35+ engagements over 12 months; • Review Cluster D (high-volume compliance) for repricing or capacity re-allocation; • Refresh the time series model monthly to track whether the flat revenue trend turns upward; • Run a 6-month forecast at the start of each quarter to guide staffing and BD investment.

Expected impact: a portfolio that generates higher revenue per engagement, stronger cash conversion, and a measurable upward trend in monthly agreed fees evidenced by the time series model’s forecast shifting upward as the mix changes.

10.4 Conclusion

The analysis began with a single question: how can engagement analytics predict high-value client work, uncover commercially significant patterns, and support strategic revenue planning at Stransact?

The answer, delivered across five techniques and 200 engagement records, is precise. The firm does not have a revenue generation problem, it has a revenue quality and retention problem. It wins engagements. It delivers them but it does not consistently collect what it bills. It does not also systematically prioritise the type of work that generates the strongest commercial outcomes.

The classification model knows what high-value work looks like. The explainability analysis knows what drives it. The cluster model knows where the problems are concentrated. PCA knows how to show it visually. And the time series knows when the firm is most vulnerable. Together they point to one action: implement a Revenue Quality Programme that uses this analytical infrastructure not as a one-time project, but as a standing management tool.

The model should be re-trained quarterly. The clusters should be reviewed monthly. The time series should be updated and forecast at the start of every quarter. The analytics do not replace partner judgment, they make it faster, more consistent, and more defensible.

Stransact’s path to revenue growth does not run through winning more engagements. It runs through collecting what it has already won, selecting future engagements more deliberately, and managing the portfolio as a strategic asset rather than an accumulation of individual client relationships. The data makes this clear. The recommendation is to act on it.

11 Limitations of the Study

Despite yielding commercially valuable insights into the drivers of high-value engagements at Stransact, several limitations could be noted with regard to the research process and findings;

To begin with, the study relied on a relatively small operational sample, of 200 engagement observations accumulated over a two-year period between May 2023 and May 2025. Although sufficient for exploratory data analysis, classification modelling, clustering and forecasting, this sample size will most likely be insufficient for model stability and predictive robustness over longer periods.

Professional services providers’ revenue are often affected by various factors such as economic cycles, regulatory deadlines and advisory needs which may have been better explored using a larger dataset spanning multiple business cycles.

In addition, the dataset consisted of operational records of a single professional services firm. As a result, the findings pertain specifically to the operational structure, pricing, services and client portfolio of Stransact. This limits the potential applicability to other companies in the market; While many of the insights may be applicable in comparable firms, the results cannot automatically be extended to all professional services firm without further investigation.

Third, several variables known to affect engagement value were missing from the operational records provided to the author. These include: indicators of client satisfaction, proposal conversion rates, relationship strength with partners, macroeconomic conditions, and demand shocks in specific sectors. The models in this case, focus on commercially observable operational patterns rather than the broader strategic context that drives client value creation.

Fourth, although the Random Forest classification model yielded impressive results, it’s explainability through SHAP analysis and variable importance remains probabilistic in nature. In other words, the models help identify variables related to commercially successful engagements, but they do not establish causation. For instance, the model indicates a very strong correlation between high collection rates and high-value engagements, but it does not prove that collections alone are responsible for the strategic value of engagements.

Furthrmore, due to the relatively short span of available monthly revenue records, the Holt-Winters and ARIMA models are likely sufficient for short-term operational forecasting. However, their forecast confidence intervals are relatively wide, which can be attributed to the relatively small number of observations used. A longer revenue history would yield more accurate trend estimates, seasonality patterns and forecasts.

In conclusion, the study was performed under the practical constraints of an MBA analytics project. Consequently, the analysis focused on efficient and interpretable machine learning models rather than computationally intensive deep learning and ensemble methods.

11.1 Further Work

Several avenues of research could be pursued in future findings.

To begin with, extending the observation period to five or more years would enable better analysis the dataset longitudinally and organizationally. It will help predict the revenue trends over time, recurring compliance cycles, client retention dynamics, and sensitivity to macroeconomic changes.

Similarly, the addition of operational records from other professional services firms would facilitate comparative analyses and increase the external validity of the models developed.

Future studies could also introduce more behavioural and relational variables, which should include: success rates of proposals, satisfaction levels of clients, engagement turnaround times, level of partner involvement, delayed payment patterns, and referral generation. These variables would help understand how commercial relationships develop within professional services firms.

In terms of modeling techniques, more advanced machine learning architectures could be deployed if more computing power and data becomes available. Some of the possible directions for future research include: gradient boosting machines (XGBoost and LightGBM), ensemble learning frameworks, bayesian forecasting models, neural network-based time series models, and survival models for client retention. These techniques may lead to increased predictive performance in larger datasets.

Furthermore, future projects could aim to implement real-time operational dashboards that are integrated directly into business development and engagement management processes. In contrast to this project, which focuses on retrospective reporting, the next steps could involve implementing tools to score ongoing engagements and prioritize clients based on revenue intelligence.

Another useful direction for further research would be integrating financial forecasting with workforce planning and capacity optimization models. As the study revealed, utilization, billing efficiency and collections were the key commercial drivers. Therefore, connecting revenue intelligence with workforce allocation could optimize consulting and compliance teams’ deployment.

Lastly, it would be useful to assess the organizational impact of introducing revenue intelligence systems. For example, it could evaluate if using predictive engagement scoring and client segmentation contribute to revenue growth, profitability, collections, client retention, and expansion of service lines.

References

Adubi, O. (2026). Anonymised Stransact Revenue Intelligence Engagement Dataset. Collected from Stransact, Lagos, Nigeria. Data available on request from the author.

Boehmke, B., & Greenwell, B. M. (2020). Hands-on machine learning with R. CRC Press.

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer.

Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles in R. https://www.tidymodels.org/

Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble

Pedersen, T. L. (2024). patchwork: The composer of plots (R package version 1.2.0). https://CRAN.R-project.org/package=patchwork

Robinson, D., & Hayes, A. (2024). broom: Convert statistical analysis objects into tidy tibbles (R package version 1.0.6). https://CRAN.R-project.org/package=broom

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

Yu, G. (2024). ggplotify: Convert plot to ggplot object (R package version 0.1.2). https://CRAN.R-project.org/package=ggplotify

Zwillinger, D., & Kokoska, S. (2000). CRC standard probability and statistics tables and formulae. CRC Press.