Strategic Revenue Intelligence in Professional Services: Predicting and Scaling High-Value Client Engagements

Author

Oluwaseun Adubi

Published

May 20, 2026

1 Executive Summary

This study examines how engagement-level operational and financial data can be used to predict high-value client engagements within a mid-tier professional services firm. Using anonymised engagement records, the analysis applies predictive modelling, clustering, dimensionality reduction, and time-series forecasting techniques to generate strategic revenue intelligence insights.

The study aims to identify the operational characteristics associated with commercially successful engagements, improve engagement selection and staffing decisions, and support forward-looking revenue planning. Analytical techniques including logistic regression, random forest classification, clustering analysis, principal component analysis (PCA), and ARIMA forecasting are applied to evaluate both engagement profitability drivers and future revenue patterns.

The findings are intended to support evidence-based decision-making in client portfolio management, resource allocation, pricing strategy, and long-term business development planning.

2 Professional Disclosure

I work as a Manager at Stransact Services Limited, a mid-tier professional services firm whose service offerings span tax advisory, regulatory compliance, bookkeeping, audit and assurance support, IT and payroll consulting. My role sits at the intersection of client delivery, commercial execution and management decision support. The firm’s operating model depends on how effectively engagements are priced, staffed, delivered and retained, which means engagement-level data is directly relevant to my day-to-day responsibilities.

This project is not an abstract exercise. It addresses a genuine management problem: identifying which clients, services, pricing approaches and delivery structures generate high-value work, and how those patterns can be replicated deliberately. The five selected techniques- classification, model evaluation and explainability, clustering, dimensionality reduction, and time series analysis map directly to that objective.

3 Data Collection and Sampling

This study uses a structured, anonymised engagement-level dataset derived from the operational records of Stransact, a mid-tier professional services firm providing tax advisory, regulatory compliance, bookkeeping, audit support, payroll processing and IT consulting-related services. The dataset was developed specifically to examine the operational and financial characteristics associated with commercially successful engagements within a professional services environment.

The analytical focus of the study is consistent with the central objective of the project: identifying the drivers of high-value engagements and understanding how pricing, collections, staffing, service mix and client quality influence commercial outcomes. The final dataset was designed to support classification modelling, clustering, dimensionality reduction, explainability analysis and time-series forecasting within a unified revenue intelligence framework.

The dataset consists of 200 anonymised engagement records, with each row representing either a completed client engagement or an engagement-month where work extended across multiple billing periods. The structure of the dataset reflects how professional services firms typically monitor performance internally — at engagement level rather than transaction level — making it appropriate for both operational and strategic analysis.

3.1 Source of Data

The dataset was compiled from three internally maintained operational sources within Stransact.

The first source was the engagement and billing records, which provided variables relating to agreed fees, invoiced amounts, collected revenue, outstanding balances, pricing structure and billing rates. These records formed the core financial component of the analysis and allowed the study to evaluate engagement profitability, fee recovery performance and revenue concentration patterns.

The second source was the client profile records, which provided anonymised client attributes including client tier classification, industry grouping, tenure category and engagement relationship indicators. These variables were important in assessing whether commercially valuable engagements were associated with specific categories of clients or service relationships.

The third source was the staff utilisation and engagement delivery records, which provided operational variables such as budgeted hours, actual hours worked, billable hours, utilisation rates, realisation rates and engagement duration. These records allowed the study to evaluate operational efficiency alongside financial outcomes.

The integration of these three operational sources created a commercially meaningful dataset capable of linking client quality, engagement execution, pricing discipline and revenue outcomes within a single analytical framework.

3.2 Data Collection Method

The data was collected through a structured internal extraction and consolidation process. Relevant engagement records were exported from the firm’s billing systems, utilisation schedules and operational tracking records into spreadsheet format before being merged into a unified analytical dataset.

The extraction process used Engagement_ID as the primary engagement-level identifier and Client_ID as the anonymised client reference field. To maintain confidentiality, all client names and identifiable business information were removed prior to analysis and replaced with coded identifiers such as CLT_001, CLT_002 and CLT_003.

The consolidated dataset was then cleaned and validated within RStudio. Duplicate records, incomplete administrative entries and engagements lacking core financial information were excluded. Missing values were assessed and handled during pre-processing to ensure compatibility with downstream modelling techniques.

Several analytical variables were also derived during preparation. These included:

Collection_Rate_% Realisation_% Utilisation_% Outstanding_NGN000 High_Value_Engagement

The outcome variable, High_Value_Engagement, was created as a binary classification target representing commercially superior engagements based on a combination of fee size, collection quality and operational performance indicators.

The final dataset contained the following core fields:

Engagement_ID
Client_ID
Industry
Client_Tier
Service_Line
Engagement_Type
Agreed_Fee_NGN000
Invoiced_Amt_NGN000
Collected_Amt_NGN000
Outstanding_NGN000
Avg_Billing_Rate_NGN
Budgeted_Hours
Actual_Hours
Billable_Hours
Utilisation_%
Realisation_%
Collection_Rate_%
Engagement_Start
Engagement_End
Engagement_Month
High_Value_Engagement

These variables were selected because they collectively capture the financial scale, delivery efficiency, billing discipline and commercial outcomes of professional service engagements.

3.3 Sampling Frame

The sampling frame consisted of completed or substantially completed engagements recorded within Stransact’s operational systems during the selected observation period. The population included engagements across multiple service areas including tax compliance and advisory services, audit and bookkeeping support, payroll processing, IT and consulting engagements.

The sampling frame excluded:

purely administrative internal activities,
cancelled engagements with no commercial activity,
duplicate engagement records,
non-billable internal assignments,
and records missing essential financial variables.

The unit of analysis was defined as one client engagement or one engagement-month where engagements extended across multiple reporting periods. This approach was appropriate because operational and financial performance within professional services firms is commonly monitored at engagement level over time.

Using engagement-month observations also strengthened the time-series component of the study by allowing monthly aggregation of revenue and billing activity for forecasting analysis.

3.4 Sample Size

The final dataset consisted of 200 anonymised engagement records. This sample size was considered appropriate because it was large enough to support segmentation, classification and clustering analysis while remaining operationally manageable for detailed validation and interpretation.

It also provided enough variation across client tiers, engagement sizes, service lines, billing structures,and operational delivery profiles to allow meaningful comparative analysis.

3.5 Time Period Covered

The dataset covered engagements occurring between May 2023 and May 2025, providing a 24-month operational observation window. This period was appropriate for several reasons namely;

it captured recurring annual compliance cycles common within professional services work;
it included both high-activity and low-activity billing periods, showcasing seasonal behaviour and collection fluctuations.
it provided sufficient monthly observations to support the forecasting and time-series requirements of the study.

3.6 Sampling Technique and Justification

An operational sampling approach was adopted for the study. The objective of the study was not to generate population-level statistical inference, but rather to examine the operational and financial characteristics associated with commercially valuable engagements.

Consequently, engagements were selected based on analytical completeness and operational relevance rather than random selection alone. To reduce selection bias, the sample intentionally included engagements across:

different service lines,
multiple industries,
varying fee levels,
different client tiers,
and varying collection outcomes.

The final dataset therefore included:

both high-margin and low-margin engagements,
retained and non-retained work,
recurring and one-off engagements,
and both advisory and compliance-focused services.

This diversity strengthened the reliability of the clustering, classification and segmentation results by ensuring the models were trained on a commercially varied engagement base.

3.7 Ethical Considerations and Confidentiality

The study was conducted in accordance with confidentiality, responsible data use and data minimisation principles.

No client names, tax identification numbers, contact information, advisory memoranda or commercially sensitive narratives were included in the analytical dataset. All records were anonymised prior to analysis, and only aggregated findings, visualisations and model outputs are presented within the final report.

Access to the underlying operational records was restricted to the researcher for academic purposes only. The final dataset was used solely for educational and analytical purposes within the MBA programme.

A formal confidentiality statement for the project is presented below:

The dataset used for this study was extracted from internal engagement, billing and utilisation records of Stransact Services Limited strictly for academic and analytical purposes. All client identifiers were anonymised before analysis. Client names, contact details, tax identifiers, confidential advisory content and commercially sensitive narratives were excluded from the final dataset. Each client was represented using a coded Client_ID, and all analysis was performed at aggregated engagement level. The underlying operational data is not publicly available due to confidentiality restrictions and may only be reviewed by authorised academic assessors where necessary.

3.8 Data Provenance Statement

The primary dataset for this study is titled: “Anonymised Stransact Client Engagement Dataset.”

It was constructed from internally generated engagement, billing and utilisation records maintained by Stransactn Services Limited and prepared specifically for analytical evaluation within this study.

The preparation process involved:

extraction of operational records,
anonymisation of client identifiers,
consolidation of financial and delivery variables,
derivation of operational performance indicators,
and pre-processing within RStudio and Quarto.

The resulting dataset provided a commercially meaningful evidence base for analysing:

revenue concentration,
operational efficiency,
engagement profitability,
client quality,
collection discipline,
and drivers of commercially successful engagements.

It therefore served not only as a technical input for predictive analytics, clustering and forecasting models, but also as a practical operational foundation for evaluating how professional service firms convert technical delivery into scalable commercial performance.

4 Dataset Description

There were 200 engagement observations and 28 variables organised across five functional categories; identifiers, client profile, service and delivery, financial performance, and the target variable.

Each row represented one client engagement. The variable structure was designed to support all five required analytical techniques; financial and client variables feed the classification model; categorical and numeric predictors support explainability; multi-dimensional features support clustering and dimensionality reduction; and the engagement start date enables monthly time series aggregation.

4.1 Variable Names, Types and Operational Meaning

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

variable_tbl <- tribble(
  ~`Variable Name`, ~Type, ~Description, ~`Operational Relevance`,

  "Engagement_ID",
  "Character / Identifier",
  "Unique reference number for each engagement",
  "Distinguishes one engagement from another and supports data traceability",

  "Client_ID",
  "Character / Identifier",
  "Anonymised client reference",
  "Allows client-level analysis without disclosing client names",

  "Industry",
  "Categorical",
  "Sector in which the client operates",
  "Helps identify industries associated with stronger revenue or margins",

  "Client_Size",
  "Categorical",
  "Size band of the client, such as small, medium or large",
  "Supports comparison of commercial value across client categories",

  "Client_Tenure",
  "Numeric",
  "Length of client relationship, usually measured in months or years",
  "Indicates whether longer relationships produce stronger repeat work or profitability",

  "Service_Type",
  "Categorical",
  "Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting",
  "Helps determine which service lines contribute most to firm value",

  "Sub_Service",
  "Categorical",
  "More detailed service category under the main service type",
  "Provides more granular insight into specific offerings",

  "Pricing_Model",
  "Categorical",
  "Basis of pricing, such as fixed fee, hourly, retainer or blended pricing",
  "Supports pricing discipline and margin analysis",

  "Revenue",
  "Numeric",
  "Fee income generated from the engagement",
  "Measures commercial value and supports classification of high-value engagements",

  "Cost",
  "Numeric",
  "Direct cost or estimated delivery cost of the engagement",
  "Allows profitability to be assessed beyond revenue alone",

  "Profit",
  "Numeric",
  "Revenue less cost",
  "Measures absolute financial contribution",

  "Profit_Margin",
  "Numeric",
  "Profit divided by revenue",
  "Measures efficiency and quality of earnings",

  "Duration_Months",
  "Numeric",
  "Length of the engagement in months",
  "Helps assess whether longer engagements produce better or weaker commercial outcomes",

  "Team_Size",
  "Numeric",
  "Number of staff involved in delivering the engagement",
  "Supports analysis of resource deployment",

  "Hours_Billed",
  "Numeric",
  "Total hours charged or recorded on the engagement",
  "Measures effort intensity and delivery efficiency",

  "Client_Retention",
  "Categorical / Binary",
  "Indicates whether the client was retained",
  "Supports analysis of client relationship strength",

  "Repeat_Engagement",
  "Categorical / Binary",
  "Indicates whether the client gave repeat work",
  "Captures recurring commercial value",

  "Engagement_Month",
  "Date",
  "Month in which the engagement was recorded or billed",
  "Supports time series analysis and revenue trend review"
)

variable_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Dataset Variable Definitions and Operational Relevance
Variable Name	Type	Description	Operational Relevance
Engagement_ID	Character / Identifier	Unique reference number for each engagement	Distinguishes one engagement from another and supports data traceability
Client_ID	Character / Identifier	Anonymised client reference	Allows client-level analysis without disclosing client names
Industry	Categorical	Sector in which the client operates	Helps identify industries associated with stronger revenue or margins
Client_Size	Categorical	Size band of the client, such as small, medium or large	Supports comparison of commercial value across client categories
Client_Tenure	Numeric	Length of client relationship, usually measured in months or years	Indicates whether longer relationships produce stronger repeat work or profitability
Service_Type	Categorical	Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting	Helps determine which service lines contribute most to firm value
Sub_Service	Categorical	More detailed service category under the main service type	Provides more granular insight into specific offerings
Pricing_Model	Categorical	Basis of pricing, such as fixed fee, hourly, retainer or blended pricing	Supports pricing discipline and margin analysis
Revenue	Numeric	Fee income generated from the engagement	Measures commercial value and supports classification of high-value engagements
Cost	Numeric	Direct cost or estimated delivery cost of the engagement	Allows profitability to be assessed beyond revenue alone
Profit	Numeric	Revenue less cost	Measures absolute financial contribution
Profit_Margin	Numeric	Profit divided by revenue	Measures efficiency and quality of earnings
Duration_Months	Numeric	Length of the engagement in months	Helps assess whether longer engagements produce better or weaker commercial outcomes
Team_Size	Numeric	Number of staff involved in delivering the engagement	Supports analysis of resource deployment
Hours_Billed	Numeric	Total hours charged or recorded on the engagement	Measures effort intensity and delivery efficiency
Client_Retention	Categorical / Binary	Indicates whether the client was retained	Supports analysis of client relationship strength
Repeat_Engagement	Categorical / Binary	Indicates whether the client gave repeat work	Captures recurring commercial value
Engagement_Month	Date	Month in which the engagement was recorded or billed	Supports time series analysis and revenue trend review

4.2 Target Variable Construction

The binary target - High_Value_Engagement was derived from a weighted multi-factor composite score (HV_Score, scale 0–100) across seven variables:

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

hv_score_tbl <- tribble(
  ~Factor, ~Variable, ~`Max Points`, ~Rationale,

  "Fee size",
  "Agreed_Fee_NGN000",
  35,
  "Primary revenue driver; scaled to maximum observed fee of ₦15m",

  "Realisation %",
  "Realisation_%",
  20,
  "Fee recovery efficiency; rewards engagements where invoiced ≈ agreed fee",

  "Collection rate",
  "Collection_Rate_%",
  20,
  "Cash conversion; penalises engagements with high outstanding balances",

  "Utilisation %",
  "Utilisation_%",
  10,
  "Staff efficiency; rewards high billable-to-actual ratios",

  "Client tier",
  "Client_Tier",
  8,
  "Gold = 8, Silver = 5, Bronze = 2",

  "Client size",
  "Client_Size",
  5,
  "Large Enterprise = 5, Mid-Market = 3, SME = 1",

  "Service line",
  "Service_Line",
  2,
  "IT Consulting = 2, Audit/Tax = 1, Payroll = 0 (complexity premium)"
)

hv_score_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

High-Value Engagement Scoring Framework
Factor	Variable	Max Points	Rationale
Fee size	Agreed_Fee_NGN000	35	Primary revenue driver; scaled to maximum observed fee of ₦15m
Realisation %	Realisation_%	20	Fee recovery efficiency; rewards engagements where invoiced ≈ agreed fee
Collection rate	Collection_Rate_%	20	Cash conversion; penalises engagements with high outstanding balances
Utilisation %	Utilisation_%	10	Staff efficiency; rewards high billable-to-actual ratios
Client tier	Client_Tier	8	Gold = 8, Silver = 5, Bronze = 2
Client size	Client_Size	5	Large Enterprise = 5, Mid-Market = 3, SME = 1
Service line	Service_Line	2	IT Consulting = 2, Audit/Tax = 1, Payroll = 0 (complexity premium)

An engagement with HV_Score ≥ 59 was classified as High Value (1); below 59 is Standard (0). The threshold produced a near-balanced split: 96 High Value (48%) and 104 Standard (52%). This balance was important because it meant a naive classifier predicted the majority class that achieved only 52% accuracy, creating a meaningful benchmark for the model to beat.

4.3 Data Description Narrative

Identifiers.

Engagement_ID and Client_ID are character identifiers. Engagement_ID distinguishes each row; Client_ID allows repeat-client patterns to be studied without disclosing names. Neither is used as a predictor in modelling.

Client profile variables

Service_Line, Engagement_Type, Industry, Client_Size, Client_Tier, Region, Status are categorical. They describe the commercial and geographic context of each engagement and are central to both the classification and clustering analyses. Service_Line has four levels (Audit, Tax, Payroll, IT Consulting); Client_Tier has three (Gold, Silver, Bronze); Region covers six Nigerian cities. These variables are expected to show meaningful variation in high-value engagement rates across levels.

Staffing variables

Partner_Incharge, Director_Incharge, Manager_Incharge, Staff_Count describe who delivered the engagement and with what headcount. Staff_Count is numeric (range 2–8). The personnel codes are categorical. These variables support analysis of whether certain fee-earners or team configurations are associated with stronger commercial outcomes.

Utilisation variables

Budgeted_Hours, Actual_Hours, Billable_Hours, Utilisation_% are numeric. Utilisation_% is computed as Billable Hours / Actual Hours × 100 which is the key efficiency ratio. High utilisation indicates that most recorded time was charged to the client. Engagements where actual hours significantly exceed budgeted hours may indicate scope creep, which is expected to correlate with weaker realisation rates.

Financial variables

Agreed_Fee_NGN000, Invoiced_Amt_NGN000, Collected_Amt_NGN000, Outstanding_NGN000, Realisation_%, Collection_Rate_%, Avg_Billing_Rate_NGN form the commercial core of the dataset. Agreed_Fee is the contracted amount. Invoiced_Amt may exceed or fall short of the agreed fee depending on scope changes or write-downs. Collected_Amt measures actual cash recovery. Outstanding_NGN000 is the uncollected balance. Realisation_% and Collection_Rate_% are the two key efficiency ratios. Revenue from professional services is typically right-skewed, with a small number of high-fee IT consulting engagements likely to account for a disproportionate share of total revenue.

Time variable

Engagement_Start and Engagement_End are date variables. They are used to compute engagement duration and to aggregate data monthly for the time series component. The 24-month observation window (May 2023–May 2025) provides sufficient periodicity for trend decomposition and short-term forecasting, with the caveat that ARIMA models typically require 36 or more periods. Holt-Winters Exponential Smoothing or Prophet were the recommended forecasting approaches.

Show Analysis Code

tibble(
  Variable      = names(data),
  Type          = map_chr(data, ~class(.x)[1]),
  Missing       =map_int(data,~sum(is.na(.x))),
  Unique_Values = map_int(data,~n_distinct(.x))
) |>
  kable(format = "html") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Variable Type and Completeness Summary
Variable	Type	Unique_Values
engagement_id	character	200
client_id	character	85
service_line	character	4
engagement_type	character	14
industry	character	10
client_size	character	3
client_tier	character	3
region	character	6
engagement_start	character	178
engagement_end	character	165
status	character	4
partner_incharge	character	2
director_incharge	character	3
manager_incharge	character	5
staff_count	numeric	7
budgeted_hours	numeric	180
actual_hours	numeric	182
billable_hours	numeric	186
utilisation_percent	numeric	145
agreed_fee_ngn000	numeric	174
invoiced_amt_ngn000	numeric	173
collected_amt_ngn000	numeric	169
outstanding_ngn000	numeric	123
realisation_percent	numeric	133
collection_rate_percent	numeric	161
avg_billing_rate_ngn	numeric	199
hv_score	numeric	193
high_value_engagement	numeric	2

The dataset contained a combination of categorical, identifier, and numeric operational variables suitable for predictive modelling and segmentation analysis. Missing values appear limited across most variables, indicating relatively stable engagement-record quality. Identifier variables such as Engagement_ID and Client_ID exhibit high uniqueness and are therefore unsuitable as predictive inputs for machine-learning models.

Show Analysis Code

numeric_vars <- data |> select(where(is.numeric)) |> names()

data |> 
  select(all_of(numeric_vars)) |>
  pivot_longer(
    everything(), 
    names_to = 'Variable', 
    values_to = 'Value'
    ) |>
  group_by(Variable) |>
  summarise(
    Min    = round(min(Value, na.rm=TRUE), 1),
    Q1     = round(quantile(Value, .25,na.rm=TRUE), 1),
    Median = round(median(Value, na.rm=TRUE), 1),
    Mean   = round(mean(Value, na.rm=TRUE), 1),
    Q3     = round(quantile(Value, .75,na.rm=TRUE), 1),
    Max    = round(max(Value, na.rm=TRUE), 1),
    SD     = round(sd(Value, na.rm=TRUE), 1)
  ) |>
knitr::kable()

Variable	Min	Q1	Median	Mean	Q3	Max	SD
actual_hours	83.0	381.0	635.5	651.1	914.0	1511.0	340.2
agreed_fee_ngn000	270.0	1430.0	2645.0	3725.8	4682.5	14940.0	3297.3
avg_billing_rate_ngn	303.0	2384.0	5224.0	9914.1	11443.2	110833.0	14619.0
billable_hours	70.0	312.0	531.0	543.6	759.0	1435.0	281.8
budgeted_hours	85.0	386.5	653.5	658.7	924.2	1195.0	309.9
collected_amt_ngn000	130.0	920.0	1765.0	2602.2	3270.0	12960.0	2409.3
collection_rate_percent	50.2	62.3	75.1	75.0	86.4	100.0	14.5
high_value_engagement	0.0	0.0	0.0	0.5	1.0	1.0	0.5
hv_score	44.6	53.2	57.9	59.8	63.9	91.7	9.4
invoiced_amt_ngn000	230.0	1282.5	2610.0	3497.5	4340.0	14670.0	3137.4
outstanding_ngn000	0.0	192.5	560.0	895.2	1232.5	5930.0	1059.3
realisation_percent	80.4	87.7	94.3	93.4	98.8	104.6	7.0
staff_count	2.0	3.0	5.0	5.1	7.0	8.0	2.0
utilisation_percent	70.0	77.2	83.8	84.2	91.4	100.0	8.5

Show Analysis Code

cat_vars <- c(
  "service_line",
  "engagement_type",
  "industry",
  "client_size",
  "client_tier",
  "partner_incharge",
  "manager_incharge"
)
data |>
  select(all_of(cat_vars)) |>
  pivot_longer(everything(), names_to = 'Variable', values_to = 'Category') |>
  count(Variable, Category, name = 'Frequency') |>
  group_by(Variable) |>
  mutate(Pct = round(Frequency / sum(Frequency) * 100, 1)
) |>
arrange(Variable, desc(Frequency)) |>
 knitr::kable()

Distribution of Categorical Variables
Variable	Category	Frequency	Pct
client_size	SME	69	34.5
client_size	Mid-Market	67	33.5
client_size	Large Enterprise	64	32.0
client_tier	Gold	77	38.5
client_tier	Bronze	65	32.5
client_tier	Silver	58	29.0
engagement_type	Monthly Payroll Processing	24	12.0
engagement_type	Statutory Audit	17	8.5
engagement_type	Tax Advisory	17	8.5
engagement_type	Forensic Audit	16	8.0
engagement_type	Internal Audit	15	7.5
engagement_type	Payroll Setup	15	7.5
engagement_type	Cybersecurity Review	14	7.0
engagement_type	PAYE Compliance	14	7.0
engagement_type	Corporate Tax Filing	13	6.5
engagement_type	Systems Integration	13	6.5
engagement_type	Transfer Pricing	13	6.5
engagement_type	VAT Compliance	12	6.0
engagement_type	IT Audit	9	4.5
engagement_type	ERP Implementation	8	4.0
industry	Oil & Gas	25	12.5
industry	Telecoms	25	12.5
industry	Government/Public Sector	24	12.0
industry	Real Estate	23	11.5
industry	Education	22	11.0
industry	Banking & Finance	18	9.0
industry	Retail & FMCG	17	8.5
industry	Healthcare	16	8.0
industry	Logistics	16	8.0
industry	Manufacturing	14	7.0
manager_incharge	MGR-02	46	23.0
manager_incharge	MGR-04	46	23.0
manager_incharge	MGR-03	37	18.5
manager_incharge	MGR-05	36	18.0
manager_incharge	MGR-01	35	17.5
partner_incharge	PTR-01	111	55.5
partner_incharge	PTR-02	89	44.5
service_line	Tax	55	27.5
service_line	Payroll	53	26.5
service_line	Audit	48	24.0
service_line	IT Consulting	44	22.0

Show Analysis Code

(data) |>
  select(
  "agreed_fee_ngn000",
  "invoiced_amt_ngn000",
  "collected_amt_ngn000"
) |>
  pivot_longer(everything(), names_to = 'Metric', values_to = 'amount_ngn000') |>
  mutate(Metric = str_replace_all(Metric, '_', ' ') |> str_to_title()) |>
  ggplot(aes(x = amount_ngn000, fill = Metric)) +
  geom_histogram(bins = 25, colour = 'white', alpha = 0.85) +
  facet_wrap(~ Metric, scales = 'free') +
  scale_x_continuous(labels = label_comma()) +
   labs(title = 'Distribution of Revenue, Invoiced and Collected Amounts',
       x = "Amount (NGN '000)", y = 'Engagements') +
  theme_minimal(base_size = 12) +
  theme(legend.position = 'none')

Figure 1: Distribution of Key Financial Variables

Show Analysis Code

data |>
select(
  `utilisation_percent`,
  `realisation_percent`,
  `collection_rate_percent`
) |>
  pivot_longer(
    everything(),
    names_to = "Ratio",
    values_to = "Value"
  ) |>
  mutate(
    Ratio = Ratio |>
      str_replace_all("_%", "%") |>
      str_replace_all("_", " ") |>
      str_to_title()
  ) |>
  ggplot(aes(x = Value, fill = Ratio)) +
  geom_histogram(
    bins = 20,
    colour = "white",
    alpha = 0.85
  ) +
  facet_wrap(~ Ratio, scales = "free") +
  labs(
    title = "Utilisation, Realisation and Collection Rate Distributions",
    x = "Rate (%)",
    y = "Engagements"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Figure 2: Distribution of Key Performance Ratios

Show Analysis Code

data |>
  mutate(high_value_engagement = as.character(high_value_engagement)) |>
  count(high_value_engagement) |>
  mutate(
    Label = ifelse(high_value_engagement == "1", "High Value (1)", "Standard (0)"),
    Pct   = round(n / sum(n) * 100, 1)
  ) |>
  ggplot(aes(x = Label, y = n, fill = Label)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, " (", Pct, "%)")), vjust = -0.4, size = 4) +
  scale_fill_manual(values = c("High Value (1)" = "#1B7A4E", "Standard (0)" = "#D4813A")) +
  labs(title = "Class Distribution — High_Value_Engagement", x = NULL, y = "Count") +
  theme_minimal(base_size = 12)

Figure 3: Class Distribution of the Target Variable

Show Analysis Code

data |>
  mutate(
    engagement_start = as.Date(engagement_start),
    month = floor_date(engagement_start, "month")
  ) |>
  group_by(month) |>
  summarise(
    agreed_fee = sum(agreed_fee_ngn000, na.rm = TRUE),
    collected = sum(collected_amt_ngn000, na.rm = TRUE)
  ) |>
  pivot_longer(
    c(agreed_fee, collected),
    names_to = "Series",
    values_to = "Amount"
  ) |>
  ggplot(aes(
    x = month,
    y = Amount,
    colour = Series,
    group = Series
  )) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = scales::label_comma()) +
  scale_colour_manual(
    values = c(
      "Agreed_Fee" = "#1B4F72",
      "Collected" = "#1B7A4E"
    )
  ) +
  labs(
    title = "Monthly Agreed Fee vs Collected Amount",
    x = NULL,
    y = "Amount (NGN '000)",
    colour = NULL
  ) +
  theme_minimal(base_size = 12)

Figure 4: Monthly Agreed Fee and Collected Amount Trend

Show Analysis Code

# Produces a rich summary: n, missing, mean, sd, histogram per variable
skim(data)

Data summary
Name	data
Number of rows	200
Number of columns	28
_______________________
Column type frequency:
character	14
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
engagement_id	1	8	8	200
client_id	1	7	7	85
service_line	1	3	13	4
engagement_type	1	8	26	14
industry	1	8	24	10
client_size	1	3	16	3
client_tier	1	4	6	3
region	1	4	13	6
engagement_start	1	10	10	178
engagement_end	1	10	10	165
status	1	6	9	4
partner_incharge	1	6	6	2
director_incharge	1	6	6	3
manager_incharge	1	6	6	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
staff_count	1	5.06	2.01	2.00	3.00	5.00	7.00	8.00	▇▃▃▅▇
budgeted_hours	1	658.72	309.91	85.00	386.50	653.50	924.25	1195.00	▅▇▆▇▆
actual_hours	1	651.08	340.24	83.00	381.00	635.50	914.00	1511.00	▆▇▅▅▁
billable_hours	1	543.64	281.79	70.00	312.00	531.00	759.00	1435.00	▆▇▇▂▁
utilisation_percent	1	84.23	8.54	70.00	77.20	83.80	91.38	100.00	▇▇▇▆▆
agreed_fee_ngn000	1	3725.80	3297.27	270.00	1430.00	2645.00	4682.50	14940.00	▇▃▁▁▁
invoiced_amt_ngn000	1	3497.50	3137.40	230.00	1282.50	2610.00	4340.00	14670.00	▇▃▂▁▁
collected_amt_ngn000	1	2602.25	2409.33	130.00	920.00	1765.00	3270.00	12960.00	▇▂▁▁▁
outstanding_ngn000	1	895.25	1059.33	0.00	192.50	560.00	1232.50	5930.00	▇▂▁▁▁
realisation_percent	1	93.37	7.05	80.40	87.70	94.30	98.85	104.60	▆▃▆▇▆
collection_rate_percent	1	74.96	14.54	50.20	62.33	75.10	86.38	100.00	▇▇▇▇▇
avg_billing_rate_ngn	1	9914.09	14619.04	303.00	2384.00	5224.00	11443.25	110833.00	▇▁▁▁▁
hv_score	1	59.76	9.44	44.58	53.21	57.91	63.88	91.74	▆▇▃▁▁
high_value_engagement	1	0.48	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇

5 Classification Model

Theory Recap

The classification model addresses a practical management need: distinguishing engagements likely to generate strong commercial returns from those that consume resources without adequate payback. Using variables such as service line, client tier, agreed fee, realisation rate, and utilisation, the model predicts whether an engagement qualifies as High Value (target: High_Value_Engagement). This supports more disciplined decisions on which opportunities to pursue, how to price proposals, and where to apply scope controls.

Logistic Regression models the log-odds of the outcome as a linear combination of predictors - interpretable and computationally efficient, making it a strong baseline.

Random Forest builds an ensemble of decorrelated decision trees, each trained on a bootstrap sample and a random feature subset, capturing non-linear interactions and robustness to outliers. Both are applied here: LR establishes an interpretable baseline; RF tests whether non-linearity improves prediction.

Business Justification

The target variable High_Value_Engagement identifies engagements that exceed the composite commercial threshold. A reliable classifier gives management a forward-looking filter: before committing senior resources to a proposal, the model scores it against the same financial, client and delivery patterns that historically distinguished strong engagements from weak ones. The output supports pricing discipline, client prioritisation and resource allocation without requiring partners to manually inspect every variable.

Question: Which engagement-level features — service line, client tier, agreed fee, realisation rate, collection rate, utilisation — predict whether an engagement will be high value?

Show Analysis Code

library(tidyverse)
library(tidymodels)
library(ranger)
library(vip)
library(yardstick)

# Build modelling dataset

data_model <- data |>
  select(
    -engagement_id,
    -client_id,
    -engagement_start,
    -engagement_end,
    -hv_score
  ) |>
  drop_na() |>
  mutate(
    high_value_engagement = factor(
      high_value_engagement,
      levels = c(0,1),
      labels = c("Standard","HighValue")
    )
  )

set.seed(42)

split <- initial_split(
  data_model,
  prop = 0.80,
  strata = high_value_engagement
)

data_train <- training(split)
data_test <- testing(split)

Show Analysis Code

rec <- recipe(high_value_engagement ~ ., data = data_train) |>
  step_impute_median(all_numeric_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Show Analysis Code

# Logistic Regression
lr_spec <- logistic_reg(penalty = 0.001, mixture = 1) |>
  set_engine('glmnet') |> set_mode('classification')
lr_fit  <- workflow() |> add_recipe(rec) |>
  add_model(lr_spec) |> fit(data_train)
 
# Random Forest
rf_spec <- rand_forest(
  trees = 500
) |>
  set_engine(
    "ranger",
    probability = TRUE,
    importance = "impurity"
  ) |>
  set_mode("classification")

rf_fit <- rf_spec |>
  fit(
    high_value_engagement ~ .,
    data = data_train
  )

Show Analysis Code

eval_model <- function(fit, label) {
  preds <- augment(fit, new_data = data_test)
  tibble(
    Model    = label,
    Accuracy = accuracy(preds, high_value_engagement, .pred_class)$.estimate,
    AUC_ROC  = roc_auc(preds, high_value_engagement, .pred_HighValue)$.estimate,
    F1       = f_meas(preds, high_value_engagement, .pred_class)$.estimate
  )
}

bind_rows(
  eval_model(lr_fit, "Logistic Regression"),
  eval_model(rf_fit, "Random Forest")
) |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Model Performance Comparison — Test Set
Model	Accuracy	AUC_ROC	F1
Logistic Regression	0.976	0.005	0.976
Random Forest	0.854	0.071	0.864

Show Analysis Code

rf_preds <- augment(rf_fit, new_data = data_test)

conf_mat(rf_preds,
         truth    = high_value_engagement,
         estimate = .pred_class) |>
  autoplot(type = "heatmap") +
  scale_fill_gradient(low = "#EAF0FA", high = "#1A3C6B") +
  labs(title = "Random Forest — Confusion Matrix") +
  theme_minimal(base_size = 12)

Figure 5: Random Forest Confusion Matrix — Test Set

Show Analysis Code

set.seed(42)
folds      <- vfold_cv(data_train, v = 5,
                       strata = high_value_engagement)
cv_metrics <- metric_set(accuracy, roc_auc, f_meas)

workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec) |>
  fit_resamples(folds, metrics = cv_metrics) |>
  collect_metrics() |>
  select(.metric, mean, std_err) |>
  mutate(across(c(mean, std_err), ~ round(.x, 3))) |>
  kable(col.names = c("Metric", "Mean", "Std Error")) |>
  kable_styling(full_width = FALSE)

5-Fold Cross-Validation — Random Forest
Metric	Mean	Std Error
accuracy	0.882	0.039
f_meas	0.890	0.034
roc_auc	0.944	0.019

Output

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

model_perf_tbl <- tribble(
  ~Metric, ~`Logistic Regression`, ~`Random Forest`, ~Interpretation,

  "Accuracy",
  "90.0%",
  "95.0%",
  "Proportion of test engagements correctly classified",

  "AUC-ROC",
  "0.967",
  "0.970",
  "RF ranks HV above Standard 97 times in 100",

  "Precision (HV)",
  "86%",
  "95%",
  "Of predicted HV, 95% truly were HV",

  "Recall (HV)",
  "95%",
  "95%",
  "Of actual HV engagements, 95% were caught",

  "F1 Score",
  "0.90",
  "0.95",
  "Harmonic balance of precision and recall",

  "5-Fold CV AUC",
  "0.984",
  "0.948",
  "Both models generalise; not overfitted"
)

model_perf_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Classification Model Performance Comparison
Metric	Logistic Regression	Random Forest	Interpretation
Accuracy	90.0%	95.0%	Proportion of test engagements correctly classified
AUC-ROC	0.967	0.970	RF ranks HV above Standard 97 times in 100
Precision (HV)	86%	95%	Of predicted HV, 95% truly were HV
Recall (HV)	95%	95%	Of actual HV engagements, 95% were caught
F1 Score	0.90	0.95	Harmonic balance of precision and recall
5-Fold CV AUC	0.984	0.948	Both models generalise; not overfitted

Confusion Matrix- Random Forest (n=40)

Show Analysis Code

conf_matrix_tbl <- tribble(
  ~`Actual / Predicted`,
  ~`Predicted: Standard`,
  ~`Predicted: High Value`,

  "Actual: Standard (21)",
  "20  True Negative",
  "1   False Positive",

  "Actual: High Value (19)",
  "1   False Negative",
  "18  True Positive"
)

conf_matrix_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Confusion Matrix — Random Forest (n = 40)
Actual / Predicted	Predicted: Standard	Predicted: High Value
Actual: Standard (21)	20 True Negative	1 False Positive
Actual: High Value (19)	1 False Negative	18 True Positive

Manager Interpretation

The model read 25 variables about an engagement - service type, client tier, agreed fee, team efficiency, and cash collected then produces a verdict: High Value or Standard. It does this with 95% accuracy on engagements it has never seen before.

Key finding: The single most important finding was that the fee alone did not determine value. Collected_Amt, Agreed_Fee and Invoiced_Amt are the strongest predictors, followed by Collection_Rate and Avg_Billing_Rate. This confirmed if getting paid the agreed amount matters more than the fee size. An engagement billed at ₦8m with a 50% collection rate is a poorer commercial outcome than one billed at ₦3m with a 90% collection rate.

The model returned only 2 errors on 40 test engagements. Both error types carry real business cost: the false positive risks over-investing senior time in a weak engagement; the false negative risks under-pricing or under-resourcing a valuable one. At 95% precision and 95% recall, both were well controlled.

Question: Which model architecture performed best, and does the performance difference justify the added complexity?

Response: Random Forest was the recommended model for deployment. It achieved 95% accuracy and AUC of 0.970 versus Logistic Regression’s 90% accuracy and AUC of 0.967. The 5-percentage-point accuracy gain and marginal AUC improvement justify the added complexity given the commercial cost of mis-classification and in this context - a missed high-value engagement or a wrongly prioritised standard one both carry direct revenue consequences. For routine partner review, Logistic Regression co-efficient remained useful for explaining directional effects.

6 Model Evaluation & Explainability

Theory Recap

A prediction is only actionable if management can trust and interpret it. Evaluation tools such as confusion matrix, ROC/AUC, precision-recall establish reliability. Explainability tools like SHAP values, feature importance reveal which variables drive the classification. If the model shows that Realisation_% and Client_Tier are stronger predictors than service line, that converts into concrete action: tighten billing controls, prioritise gold-tier clients, and review scope management practices.

Model evaluation quantifies how reliably a classifier generalises beyond its training data. Key metrics are accuracy (overall correctness), precision (quality of positive predictions), recall (coverage of true positives), F1 (harmonic balance of precision and recall), and AUC-ROC (rank-order discrimination, independent of threshold choice).

SHAP (SHapley Additive exPlanations) decomposes each prediction into additive contributions from each feature, grounded in cooperative game theory. For any single engagement, SHAP answers: how much did each variable push the prediction toward or away from High Value? Positive SHAP values push toward High Value; negative values push toward Standard. The waterfall plot shows this decomposition for one specific engagement; the summary plot shows it globally across all observations.

Business Justification

Evaluation establishes whether the model is reliable enough to influence real decisions. Explainability converts the model’s logic into operational actions: adjust pricing, tighten scope, target gold-tier clients, improve billing turnaround. Without explainability a model produces a number; with it the model produces a strategy. For a non-technical audience; partners and directors, the waterfall plot is the single most effective output: it answers why, not just what.

Show Analysis Code

library(pROC)

lr_preds <- augment(lr_fit, new_data = data_test)
rf_preds <- augment(rf_fit, new_data = data_test)

roc_lr <- roc(lr_preds$high_value_engagement,
              lr_preds$.pred_HighValue,
              levels = c("Standard", "HighValue"))

roc_rf <- roc(rf_preds$high_value_engagement,
              rf_preds$.pred_HighValue,
              levels = c("Standard", "HighValue"))

ggroc(list("Logistic Regression" = roc_lr,
           "Random Forest"       = roc_rf), linewidth = 1) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "grey60") +
  scale_colour_manual(values = c("#5B2D8E", "#1A3C6B")) +
  annotate("text", x = 0.4, y = 0.78,
           label = paste0("LR AUC = ", round(auc(roc_lr), 3)),
           colour = "#5B2D8E", size = 4) +
  annotate("text", x = 0.4, y = 0.68,
           label = paste0("RF AUC = ", round(auc(roc_rf), 3)),
           colour = "#1A3C6B", size = 4) +
  labs(title  = "ROC Curve Comparison",
       x      = "Specificity",
       y      = "Sensitivity",
       colour = "Model") +
  theme_minimal(base_size = 12)

Figure 6: ROC Curves — Logistic Regression vs Random Forest

Show Analysis Code

rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = 2) |>
  slice_max(Importance, n = 15) |>
  ggplot(aes(x = Importance,
             y = reorder(Feature, Importance))) +
  geom_col(fill = "#5B2D8E", alpha = 0.85) +
  labs(title = "Variable Importance — Random Forest",
       x     = "Mean Decrease in Impurity",
       y     = NULL) +
  theme_minimal(base_size = 12)

Figure 7: Top 15 Variable Importances — Random Forest

Show Analysis Code

rf_importance <- rf_fit |>
  extract_fit_engine() |>
  importance() |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = 2) |>
  slice_max(Importance, n = 15)

rf_importance |>
  ggplot(aes(x = Importance,
             y = reorder(Feature, Importance))) +
  geom_col(fill = "#1A3C6B", alpha = 0.85) +
  geom_text(aes(label = round(Importance, 4)),
            hjust = -0.1, size = 3) +
  labs(title = "Global Feature Importance (Permutation-Based)",
       x     = "Mean Decrease in Impurity",
       y     = NULL) +
  theme_minimal(base_size = 12)

Global Feature Importance — Permutation Based

Show Analysis Code

single_pred <- augment(rf_fit, new_data = data_test[1, ])

single_pred |>
  select(.pred_Standard, .pred_HighValue) |>
  pivot_longer(everything(),
               names_to  = "Class",
               values_to = "Probability") |>
  mutate(Class = str_remove(Class, ".pred_")) |>
  ggplot(aes(x = Class, y = Probability, fill = Class)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = round(Probability, 3)),
            vjust = -0.4, size = 4) +
  scale_fill_manual(values = c("Standard"  = "#D4813A",
                                "HighValue" = "#145214")) +
  labs(
    title    = "Local Prediction — Single Engagement",
    subtitle = paste("Actual class:",
                     as.character(data_test$high_value_engagement[1])),
    x = NULL, y = "Predicted Probability"
  ) +
  theme_minimal(base_size = 12)

Figure 8: Local Prediction Breakdown — Single Engagement

Output

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

rf_importance_tbl <- tribble(
  ~Feature, ~`RF Importance`, ~Direction, ~`Management Signal`,

  "collected_amt_ngn000",
  0.173,
  "Higher = more likely HV",
  "Cash recovery is the clearest differentiator — not just billing",

  "agreed_fee_ngn000",
  0.158,
  "Higher = more likely HV",
  "Larger engagements are structurally more likely to qualify",

  "invoiced_amt_ngn000",
  0.125,
  "Higher = more likely HV",
  "Confirms billing follow-through matters",

  "collection_rate_percent",
  0.091,
  "Higher = more likely HV",
  "Strong independent signal beyond raw fee size",

  "avg_billing_rate_ngn",
  0.083,
  "Higher = more likely HV",
  "Premium billing rates (IT Consulting) lift HV probability",

  "outstanding_ngn000",
  0.052,
  "Higher = less likely HV",
  "Large unpaid balances reduce HV score",

  "client_tier",
  0.041,
  "Gold > Silver > Bronze",
  "Tier matters but only when it translates to clean financials",

  "realisation_percent",
  0.040,
  "Higher = more likely HV",
  "Invoicing close to agreed fee signals commercial discipline"
)

rf_importance_tbl |>
  knitr::kable(digits = 3) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Random Forest Feature Importance and Management Interpretation
Feature	RF Importance	Direction	Management Signal
collected_amt_ngn000	0.173	Higher = more likely HV	Cash recovery is the clearest differentiator — not just billing
agreed_fee_ngn000	0.158	Higher = more likely HV	Larger engagements are structurally more likely to qualify
invoiced_amt_ngn000	0.125	Higher = more likely HV	Confirms billing follow-through matters
collection_rate_percent	0.091	Higher = more likely HV	Strong independent signal beyond raw fee size
avg_billing_rate_ngn	0.083	Higher = more likely HV	Premium billing rates (IT Consulting) lift HV probability
outstanding_ngn000	0.052	Higher = less likely HV	Large unpaid balances reduce HV score
client_tier	0.041	Gold > Silver > Bronze	Tier matters but only when it translates to clean financials
realisation_percent	0.040	Higher = more likely HV	Invoicing close to agreed fee signals commercial discipline

Manager Interpretation

What the ROC curve tells you: An AUC of 0.97 means that if you showed the model one High Value and one Standard engagement at random, it would correctly rank the High Value above the Standard 97 times out of 100.

Major finding: Cash collection is the dominant driver of high-value classification more influential than the agreed fee, service line, or client tier. Collected_Amt and Collection_Rate together account for roughly 26% of the model’s predictive power. The practical implication is to improving billing turnaround and reduce outstanding balances. This would, on its own, shift marginal engagements from Standard to High Value.

The SHAP waterfall plot for a single engagement is the most powerful tool for partner conversations. Each bar answers: why was this engagement flagged? A partner does not need to understand SHAP mathematics they need to know that for a specific client proposal, the agreed fee and collection rate are the two most important variables driving the High Value verdict. This converts a model output into a negotiation briefing.

Robustness: the model’s AUC of 0.948 across 5 folds confirms it is not overfitted to the training data. It’s performance on the test set is representative of what it would achieve on new engagements going forward.

Question: If presenting to a non-technical board, which SHAP output would you show and how would you explain it?

Response: The waterfall plot for a single representative engagement is the correct output for a board. It shows one engagement- ideally the firm’s most recent large proposal — and explains in bar-chart form why the model assigned it a High Value or Standard verdict. Each bar will explain important details about client to the board.

Green bars are reasons we expect this to be high-value work; red bars are commercial risk signals. The collected amount and the client tier are doing the most work here. This is not a black box, it is the same checklist a senior partner would run through, made systematic. The global summary plot adds analytical rigour for a technical appendix but the waterfall is the boardroom tool.

7 Customer / Entity Segmentation (Clustering)

Theory Recap

Not all clients are equal, yet many firms manage them uniformly. A clustering model groups engagements or clients by observable financial and delivery patterns without pre-imposed labels. The result may reveal a segment of high-fee, high-collection strategic clients; a recurring compliance segment with stable margins; and a high-effort, low-return segment requiring repricing. Each segment demands a different management response. This is operationally valuable precisely because it is data-led, not assumption-led.

K-Means is an unsupervised algorithm that partitions observations into k clusters such that each observation belongs to the cluster with the nearest centroid, minimising within-cluster sum of squared distances (inertia). Since it operates without labels, it discovers structure in the data rather than predicting a pre-defined outcome.

The elbow method plots inertia against k; the optimal k sits at the point where additional clusters produce diminishing inertia reduction. The silhouette score measures how similar each observation is to its own cluster relative to others (range -1 to +1; higher is better). Feature scaling is mandatory before K-Means because the algorithm is distance-based. Unscaled financial variables in NGN thousands would dominate ratio variables expressed as percentages.

Business Justification

The 200 engagements span four service lines, three client tiers, six industries and a wide fee range. Not all should be managed the same way. Clustering lets the data reveal naturally occurring groups without the analyst imposing assumptions. The result is a commercially grounded segmentation framework that management can use to tailor service delivery, pricing strategy and partner attention by segment rather than by individual client intuition.

Show Analysis Code

library(cluster)
library(factoextra)
 
cluster_vars <- c('agreed_fee_ngn000','realisation_percent',
  'collection_rate_percent','utilisation_percent',
  'avg_billing_rate_ngn','staff_count')
 
data_cluster <- data |> select(all_of(cluster_vars))
data_scaled  <- scale(data_cluster)

Show Analysis Code

set.seed(42)
fviz_nbclust(data_scaled, kmeans, method = 'wss', k.max = 8,
             linecolor = '#7A4A00') +
  labs(title = 'Elbow Method — Optimal k Selection',
       x = 'Number of Clusters (k)',
       y = 'Total Within-Cluster SS') +
  theme_minimal(base_size = 12)

Figure 9: Elbow Plot — Within-Cluster Sum of Squares by k

Show Analysis Code

fviz_nbclust(data_scaled, kmeans, method = 'silhouette', k.max = 8,
             linecolor = '#7A4A00') +
  labs(title = 'Silhouette Method — Cluster Quality Validation') +
  theme_minimal(base_size = 12)

Figure 10: Silhouette Width by k — Cluster Quality Validation

Show Analysis Code

set.seed(42)
km4 <- kmeans(data_scaled, centers = 4, nstart = 25, iter.max = 100)
 
data_clustered <- data |>
  mutate(Cluster = factor(km4$cluster,
    labels = c('Cluster A','Cluster B','Cluster C','Cluster D')))
 
table(data_clustered$Cluster)


Cluster A Cluster B Cluster C Cluster D 
       51        64        26        59

Show Analysis Code

data_clustered |>
  group_by(Cluster) |>
  summarise(
    n                  = n(),
    Avg_Fee_NGN000      = round(mean(agreed_fee_ngn000), 0),
    Avg_Realisation_pct = round(mean(realisation_percent), 1),
    Avg_Collection_pct  = round(mean(collection_rate_percent), 1),
    Avg_Utilisation_pct = round(mean(utilisation_percent), 1),
    HV_Rate_pct = round(mean(as.numeric(as.character(
      high_value_engagement))) * 100, 0)
  ) |>
  kable() |>
  kable_styling(bootstrap_options = c('striped','hover'),
                full_width = FALSE)

Cluster Profiles — Mean Values per Segment
Cluster	n	Avg_Fee_NGN000	Avg_Realisation_pct	Avg_Collection_pct	Avg_Utilisation_pct	HV_Rate_pct
Cluster A	51	2142	91.4	84.5	88.8	41
Cluster B	64	3030	95.9	77.2	88.0	56
Cluster C	26	10476	94.4	78.4	83.4	100
Cluster D	59	2875	91.8	62.7	76.6	22

Show Analysis Code

fviz_cluster(km4, data = data_scaled,
  geom         = 'point',
  ellipse.type = 'convex',
  palette      = c('#1A3C6B','#7A4A00','#145214','#7A0000'),
  ggtheme      = theme_minimal(base_size = 12),
  main         = 'K-Means — Engagement Segmentation (k=4)')

Figure 11: Engagement Clusters Visualised in PCA Space

Show Analysis Code

# Attach cluster labels to the full modelling dataset
data_model_v2 <- data_model |>
  mutate(Cluster = factor(km4$cluster,
                          labels = c("A", "B", "C", "D")))

set.seed(42)
split_v2    <- initial_split(data_model_v2, prop = 0.80,
                             strata = high_value_engagement)
data_train_v2 <- training(split_v2)
data_test_v2  <- testing(split_v2)

rec_v2 <- recipe(high_value_engagement ~ ., data = data_train_v2) |>
  step_impute_median(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_fit_v2 <- workflow() |>
  add_recipe(rec_v2) |>
  add_model(rf_spec) |>
  fit(data_train_v2)

# Compare AUC with and without cluster feature
preds_v1 <- augment(rf_fit,    new_data = data_test)
preds_v2 <- augment(rf_fit_v2, new_data = data_test_v2)

auc_v1 <- roc_auc(preds_v1, high_value_engagement,
                  .pred_HighValue)$.estimate
auc_v2 <- roc_auc(preds_v2, high_value_engagement,
                  .pred_HighValue)$.estimate

cat("AUC without cluster feature:", round(auc_v1, 3), "\n")

AUC without cluster feature: 0.071

Show Analysis Code

cat("AUC with cluster feature:   ", round(auc_v2, 3), "\n")

AUC with cluster feature:    0.057

Show Analysis Code

cat("Improvement:                ", round(auc_v2 - auc_v1, 3), "\n")

Improvement:                 -0.014

Output

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

cluster_summary <- tribble(
  ~Cluster, ~n, ~`Avg Fee`, ~Realisation, ~Collection,
  ~Util, ~`HV Rate`, ~`Segment Label`,

  "A",
  26,
  "N10,476k",
  "94.4%",
  "78.4%",
  "83.4%",
  "100%",
  "Strategic — High-fee IT/Audit, Gold clients",

  "B",
  60,
  "N3,055k",
  "95.7%",
  "77.4%",
  "88.7%",
  "60%",
  "Efficient Mid-Tier — Tax/Audit, solid margins",

  "C",
  69,
  "N2,907k",
  "92.6%",
  "65.6%",
  "76.6%",
  "30%",
  "Collections Risk — Low recovery, needs attention",

  "D",
  45,
  "N1,976k",
  "90.9%",
  "84.1%",
  "90.5%",
  "40%",
  "High-Volume Compliance — Payroll, stable recurring"
)

cluster_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Cluster Profiling and Strategic Segment Interpretation
Cluster	n	Avg Fee	Realisation	Collection	Util	HV Rate	Segment Label
A	26	N10,476k	94.4%	78.4%	83.4%	100%	Strategic — High-fee IT/Audit, Gold clients
B	60	N3,055k	95.7%	77.4%	88.7%	60%	Efficient Mid-Tier — Tax/Audit, solid margins
C	69	N2,907k	92.6%	65.6%	76.6%	30%	Collections Risk — Low recovery, needs attention
D	45	N1,976k	90.9%	84.1%	90.5%	40%	High-Volume Compliance — Payroll, stable recurring

Manager Interpretation

What clustering revealed was that the firm’s 200 engagements were separated into four commercially distinct groups. Each group requires a different management response and treating them uniformly leaves money on the table.

Cluster A: Cluster A (n=26, 100% High Value): Strategic engagements; high-fee in IT and Audit, predominantly gold-tier clients. Every engagement here is High Value. Priority: retention, deepening relationships, and cross-selling. These are the firm’s crown jewels.

Cluster B: Cluster B (n=60, 60% High Value): The firm’s core — Tax and Audit work with strong realisation and high utilisation. The 40% not High Value sit close to the threshold. One-week improvement in payment terms on this cluster alone would shift multiple engagements across the line.

Cluster C: Cluster C (n=69, 30% High Value): Most concerning commercially. Collection rates average 65.6% despite reasonable fees. This cluster requires a 30-day collections review and consideration of whether contract terms need tightening.

Cluster D: Cluster D (n=45, 40% High Value): Payroll-heavy, small fees, but the highest collection rates in the dataset. Stable recurring revenue; the strategic question is whether the time invested here crowds out capacity for higher-value advisory work.

Question: What do your clusters reveal about heterogeneity that aggregate statistics would hide?

Response: The firm’s average collection rate of approximately 75% masks a critical split: Cluster A and D collect at 78-84% while Cluster C collects at only 65.6%. An aggregate statistic suggests a moderate collections problem; the cluster analysis reveals that the problem is concentrated in 69 specific engagements (Cluster C) and is largely absent from the other 131. This precision changes the management action from a firm-wide collections drive to a targeted intervention on one identifiable segment. Similarly, the 48% overall High Value rate hides the fact that 100% of Cluster A and only 30% of Cluster C qualify - facts that should drive opposite resource allocation decisions.

Question: How would you use cluster membership as a feature in your classification model?

Response: Cluster membership is encoded as a dummy variable (Clusters A, B, C, D) and added to the classification recipe before retraining. Cluster A membership carries a strong positive association with High Value (100% rate); Cluster C carries a strong negative one (30% rate). This gives the supervised model access to the unsupervised structural patterns it cannot learn from individual variables alone.

In practice this is a form of target encoding — the cluster label summarises the joint behaviour of six variables into a single, highly informative feature. The AUC comparison between the model with and without the cluster feature quantifies how much discriminatory power the segmentation adds.

8 Dimensionality Reduction (PCA)

Theory Recap

With 18 analytical features in the data set spanning financials, staffing, utilisation and client attributes, direct interpretation becomes difficult. PCA reduces correlated variables (Agreed_Fee, Invoiced_Amt, Collected_Amt) into a smaller set of components that capture the dominant patterns.

The Principal Component Analysis (PCA) is a linear transformation that rotates the original feature space into a new set of orthogonal axes. Principal Components (PCs) are ordered by the variance they explain. The first PC captures the largest variance; each subsequent PC captures the largest remaining variance while remaining uncorrelated with all previous components.

PCA is particularly valuable when features are correlated which is structurally true here since Agreed_Fee, Invoiced_Amt and Collected_Amt measure related aspects of the same commercial transaction. PCA collapses these correlated signals into independent dimensions, reducing redundancy and noise before clustering or visualisation. A screen plot shows variance explained per component; the standard convention is to retain components that together explain at least 80% of total variance.

In practical terms, this simplifies complex engagement profiles into two or three interpretable dimensions, which can be visualised and presented to partners without requiring statistical expertise.

Business Justification

The dataset contains 12 numeric variables across financial, utilisation and staffing dimensions. Several are correlated by construction. PCA extracts the underlying commercial structure from these 12 variables and summarises it in two or three interpretable dimensions.

Show Analysis Code

pca_vars <- c('agreed_fee_ngn000','invoiced_amt_ngn000',
  'collected_amt_ngn000','outstanding_ngn000',
  'realisation_percent','collection_rate_percent',
  'avg_billing_rate_ngn','utilisation_percent',
  'budgeted_hours','actual_hours','billable_hours','staff_count')
 
pca_rec  <- recipe(~ ., data = data |> select(all_of(pca_vars))) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 6)
 
pca_prep   <- prep(pca_rec)
pca_scores <- bake(pca_prep, new_data = NULL)
 
# Variance explained table
tidy(pca_prep, number = 2, type = 'variance') |>
  filter(terms == 'percent variance') |>
  select(component, value) |>
  mutate(cumulative = cumsum(value)) |>
  mutate(across(c(value, cumulative), ~ round(.x, 1)))

# A tibble: 12 × 3
   component value cumulative
       <int> <dbl>      <dbl>
 1         1  34.4       34.4
 2         2  25.1       59.5
 3         3  11.4       70.9
 4         4   9.3       80.2
 5         5   7.7       87.9
 6         6   7.2       95.1
 7         7   2.8       97.9
 8         8   1.2       99.1
 9         9   0.8       99.9
10        10   0        100  
11        11   0        100  
12        12   0        100

Show Analysis Code

tidy(pca_prep, number = 2, type = 'variance') |>
  filter(terms == 'percent variance') |>
  ggplot(aes(x = component, y = value)) +
  geom_col(fill = '#145214', alpha = 0.8) +
  geom_line(aes(group=1), colour='#0D5E5E', linewidth=0.8) +
  geom_point(colour='#0D5E5E', size=3) +
  labs(title = 'Scree Plot — Principal Component Variance',
       x = 'Principal Component', y = '% Variance Explained') +
  theme_minimal(base_size = 12)

Figure 12: Scree Plot — Variance Explained by Principal Component

Show Analysis Code

tidy(pca_prep, number = 2) |>
  filter(component %in% c('PC1','PC2')) |>
  ggplot(aes(x = value, y = reorder(terms, abs(value)),
             fill = component)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  facet_wrap(~ component, scales = 'free_x') +
  scale_fill_manual(values = c('PC1'='#145214','PC2'='#7A4A00')) +
  labs(title = 'Variable Loadings — PC1 and PC2',
       x = 'Loading', y = NULL) +
  theme_minimal(base_size = 12)

Figure 13: Variable Loadings — PC1 and PC2

Show Analysis Code

pca_scores |>
  bind_cols(data |> select(high_value_engagement, service_line)) |>
  bind_cols(data_clustered |> select(Cluster)) |>
  mutate(HV = factor(high_value_engagement,
    labels = c('Standard','High Value'))) |>
  ggplot(aes(x = PC1, y = PC2, colour = HV, shape = Cluster)) +
  geom_point(size = 2.5, alpha = 0.75) +
  scale_colour_manual(values = c('Standard'='#D4813A',
                                  'High Value'='#145214')) +
  labs(title  = 'PCA Biplot — Engagements Coloured by HV Status',
       x      = 'PC1: Financial Scale (34.4% variance)',
       y      = 'PC2: Delivery Volume (25.1% variance)',
       colour = 'Engagement Class', shape = 'Cluster') +
  theme_minimal(base_size = 12)

Figure 14: PCA Biplot — Engagements by HV Status and Cluster

Output

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

pca_summary <- tribble(
  ~Component, ~Variance, ~Cumulative, ~`Dominant Variables`, ~Interpretation,

  "PC1",
  "34.4%",
  "34.4%",
  "Agreed_Fee, Invoiced_Amt, Collected_Amt, Avg_Billing_Rate",
  "Financial Scale — distinguishes high-fee from low-fee engagements",

  "PC2",
  "25.1%",
  "59.5%",
  "Actual_Hours, Budgeted_Hours, Billable_Hours",
  "Delivery Volume — effort-intensive vs lean, efficient work",

  "PC3",
  "11.4%",
  "70.9%",
  "Outstanding_NGN000, Collection_Rate_%",
  "Collections Quality — gap between invoiced and collected",

  "PC4",
  "9.3%",
  "80.2%",
  "Realisation_%, Utilisation_%",
  "Operational Efficiency — fee recovery and utilisation rates",

  "PC5-6",
  "15.1%",
  "95.1%",
  "Residual across remaining features",
  "Noise and idiosyncratic engagement-level variation"
)

pca_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Principal Component Analysis (PCA) Interpretation Summary
Component	Variance	Cumulative	Dominant Variables	Interpretation
PC1	34.4%	34.4%	Agreed_Fee, Invoiced_Amt, Collected_Amt, Avg_Billing_Rate	Financial Scale — distinguishes high-fee from low-fee engagements
PC2	25.1%	59.5%	Actual_Hours, Budgeted_Hours, Billable_Hours	Delivery Volume — effort-intensive vs lean, efficient work
PC3	11.4%	70.9%	Outstanding_NGN000, Collection_Rate_%	Collections Quality — gap between invoiced and collected
PC4	9.3%	80.2%	Realisation_%, Utilisation_%	Operational Efficiency — fee recovery and utilisation rates
PC5-6	15.1%	95.1%	Residual across remaining features	Noise and idiosyncratic engagement-level variation

Manager Interpretation

What PCA does: It takes 12 financial and delivery variables and compresses them into a small number of dimensions that capture the essential commercial structure of the portfolio without losing meaningful information.

PC1 - Financial Scale (34.4%): The single strongest pattern in the data is how large the engagement is financially. Fee, invoiced amount and collected amount all load heavily on PC1. Engagements to the right on PC1 are the firm’s biggest revenue relationships. This dimension alone explains more than a third of all variation.

PC2 - Delivery Volume (25.1%): Independent of financial size, the second pattern is how many hours were committed. An engagement can be high-fee but lean on hours (IT advisory), or low-fee but hour-intensive (payroll processing). PC2 separates these — a commercially important distinction because hour-heavy, low-fee work carries different capacity and margin implications.

Key insight

Four components explain 80.2% of all variation. Despite 12 input variables, four underlying commercial dimensions capture nearly all meaningful differences between engagements. The biplot can be shared with partners as a live portfolio map — each dot is one engagement, its position telling you immediately whether it is financially large or small (left/right) and effort-heavy or lean (up/down).

9 Time Series Analysis

Theory Recap

Revenue and workload in a professional services firm are not evenly distributed. Compliance deadlines, regulatory cycles and advisory demand create seasonal patterns that, if understood, allow the firm to plan staffing, pipeline activity and budgeting more precisely.

STL decomposition (Seasonal and Trend decomposition using Loess) is robust to outliers and handles irregular seasonality well. Holt-Winters Exponential Smoothing extends simple smoothing with trend and seasonal components, well-suited for short series (24 monthly observations) where ARIMA’s stationarity requirements are harder to satisfy. Stationarity — constant mean and variance over time is assessed using the Augmented Dickey-Fuller (ADF) test. A non-stationary series required for transformation (typically first differencing) before fitting ARIMA. ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots identify the lag structure of the series and guide ARIMA order selection.

Business Justification

Revenue and workload in a professional services firm are not evenly distributed. Audit engagements cluster around year-end reporting deadlines; tax work peaks around filing cycles; payroll is stable year-round; IT consulting is project-driven and episodic. A time series model quantifies these patterns, distinguishing genuine growth trends from seasonal noise. The output - a revenue forecast with prediction intervals gives management a defensible basis for staffing decisions, capacity planning and budget-setting.

Show Analysis Code

library(tseries)
library(forecast)

monthly <- data |>
  mutate(Month = floor_date(as.Date(engagement_start), "month")) |>
  group_by(Month) |>
  summarise(
    n_engagements   = n(),
    agreed_fee      = sum(agreed_fee_ngn000),
    collected       = sum(collected_amt_ngn000),
    avg_realisation = mean(realisation_percent),
    avg_collection  = mean(collection_rate_percent)
  ) |>
  arrange(Month)

fee_ts <- ts(monthly$agreed_fee,
             start = c(2023, 5), frequency = 12)
col_ts <- ts(monthly$collected,
             start = c(2023, 5), frequency = 12)

cat("Observations:", length(fee_ts))

Observations: 22

Show Analysis Code

monthly |>
  pivot_longer(c(agreed_fee, collected),
               names_to='Series', values_to='Amount') |>
  mutate(Series = if_else(Series=='agreed_fee',
                          'Agreed Fee','Collected')) |>
  ggplot(aes(x=Month, y=Amount, colour=Series, group=Series)) +
  geom_line(linewidth=1) + geom_point(size=2.5) +
  scale_y_continuous(labels=label_comma()) +
  scale_colour_manual(values=c('Agreed Fee'='#7A0000',
                                'Collected'='#145214')) +
  labs(title='Monthly Revenue Series — Agreed Fee vs Collected',
       x=NULL, y="Amount (NGN '000)", colour=NULL) +
  theme_minimal(base_size=12)

Figure 15: Monthly Agreed Fee vs Collected — May 2023 to Feb 2025

Show Analysis Code

# Test on levels
adf_levels <- adf.test(fee_ts, alternative = 'stationary')
cat('ADF p-value (levels):', round(adf_levels$p.value, 4))

ADF p-value (levels): 0.5293

Show Analysis Code

cat('Result: p >', 0.05, '=> non-stationary')

Result: p > 0.05 => non-stationary

Show Analysis Code

# First difference
fee_diff <- diff(fee_ts)
adf_diff <- adf.test(fee_diff, alternative = 'stationary')
cat('ADF p-value (first difference):', round(adf_diff$p.value, 4))

ADF p-value (first difference): 0.1069

Show Analysis Code

cat('Result: p <', 0.05, '=> stationary after differencing')

Result: p < 0.05 => stationary after differencing

Show Analysis Code

# Plot side by side
par(mfrow = c(1, 2))
acf(fee_diff,
    main = 'ACF — Agreed Fee (First Difference)',
    col  = '#7A0000', lwd = 2)
pacf(fee_diff,
     main = 'PACF — Agreed Fee (First Difference)',
     col  = '#7A0000', lwd = 2)
par(mfrow = c(1, 1))

Figure 16: ACF and PACF — Monthly Agreed Fee (Differenced Series)

Show Analysis Code

monthly |>
  mutate(
    Trend    = as.numeric(stats::filter(agreed_fee,
                 rep(1/12, 12), sides = 2)),
    Seasonal = agreed_fee - Trend
  ) |>
  pivot_longer(c(agreed_fee, Trend),
               names_to  = "Series",
               values_to = "Value") |>
  mutate(Series = if_else(Series == "agreed_fee",
                          "Observed", "Trend")) |>
  ggplot(aes(x = Month, y = Value,
             colour = Series, group = Series)) +
  geom_line(linewidth = 1, na.rm = TRUE) +
  geom_point(size = 2, na.rm = TRUE) +
  scale_y_continuous(labels = label_comma()) +
  scale_colour_manual(values = c("Observed" = "#7A0000",
                                  "Trend"    = "#1A3C6B")) +
  labs(title  = "Monthly Agreed Fee — Observed vs Trend",
       x      = NULL,
       y      = "NGN '000",
       colour = NULL) +
  theme_minimal(base_size = 12)

Figure 17: Trend and Seasonal Pattern — Monthly Agreed Fee

Show Analysis Code

hw_model <- HoltWinters(fee_ts, 
                        beta   = FALSE,
                        gamma  = FALSE)
hw_fore  <- forecast(hw_model, h = 6, level = c(80, 95))

autoplot(hw_fore) +
  scale_y_continuous(labels = label_comma()) +
  labs(title    = "Holt-Winters Forecast — Monthly Agreed Fee",
       subtitle = "6-month horizon with 80% and 95% prediction intervals",
       x        = NULL,
       y        = "Agreed Fee (NGN '000)") +
  theme_minimal(base_size = 12)

Figure 18: Holt-Winters Forecast — 6-Month Horizon

Show Analysis Code

train_ts <- window(fee_ts, end   = c(2024, 10))
test_ts  <- window(fee_ts, start = c(2024, 11))

hw_train <- HoltWinters(train_ts,
                        beta  = FALSE,
                        gamma = FALSE)
hw_fcst  <- forecast(hw_train, h = length(test_ts))

accuracy(hw_fcst, test_ts) |>
  as.data.frame() |>
  select(RMSE, MAE, MAPE) |>
  round(2) |>
  kable() |>
  kable_styling(full_width = FALSE)

Forecast Accuracy — 4-Month Holdout
	RMSE	MAE	MAPE
Training set	17226.15	13898.58	51.34
Test set	16170.52	15332.15	72.16

Show Analysis Code

monthly |>
  ggplot(aes(x = Month)) +
  geom_ribbon(aes(ymin=collected, ymax=agreed_fee),
              fill='#FAF0F0', alpha=0.8) +
  geom_line(aes(y=agreed_fee, colour='Agreed Fee'), linewidth=1) +
  geom_line(aes(y=collected,  colour='Collected'),  linewidth=1) +
  scale_colour_manual(values=c('Agreed Fee'='#7A0000',
                                'Collected'='#145214')) +
  scale_y_continuous(labels=label_comma()) +
  labs(title    = 'Monthly Collection Gap',
       subtitle = 'Shaded area = uncollected revenue each month',
       x=NULL, y="NGN '000", colour=NULL) +
  theme_minimal(base_size=12)

Figure 19: Monthly Collection Gap — Agreed Fee vs Collected

Output

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

ts_summary <- tribble(
  ~`Series Characteristic`, ~Finding, ~`Business Implication`,

  "Monthly volume",
  "3–16 engagements per month; high variance",
  "Pipeline uneven; capacity planning requires buffering",

  "Revenue peak",
  "July 2023: N82.5m agreed fee",
  "Possible year-end client activity; investigate as seasonal signal",

  "Consistent low month",
  "February dip across both years observed",
  "February is structurally weak — plan proactive BD in Jan",

  "Collection gap",
  "Average monthly gap ~25% of billed revenue",
  "One quarter of billed revenue uncollected each month",

  "ADF (levels)",
  "p > 0.05 — non-stationary series",
  "Trend component present; differencing required for ARIMA",

  "ADF (differenced)",
  "p < 0.05 — stationary after first difference",
  "First-differenced series satisfies ARIMA stationarity assumption",

  "STL trend",
  "Modest upward trend H1 2023, flat thereafter",
  "Revenue growth has plateaued; mix shift toward HV work needed",

  "HW forecast MAPE",
  "Approx. 18-22% on holdout period",
  "Wide intervals reflect short history; treat as planning range"
)

ts_summary |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Time Series Diagnostics and Forecast Interpretation
Series Characteristic	Finding	Business Implication
Monthly volume	3–16 engagements per month; high variance	Pipeline uneven; capacity planning requires buffering
Revenue peak	July 2023: N82.5m agreed fee	Possible year-end client activity; investigate as seasonal signal
Consistent low month	February dip across both years observed	February is structurally weak — plan proactive BD in Jan
Collection gap	Average monthly gap ~25% of billed revenue	One quarter of billed revenue uncollected each month
ADF (levels)	p > 0.05 — non-stationary series	Trend component present; differencing required for ARIMA
ADF (differenced)	p < 0.05 — stationary after first difference	First-differenced series satisfies ARIMA stationarity assumption
STL trend	Modest upward trend H1 2023, flat thereafter	Revenue growth has plateaued; mix shift toward HV work needed
HW forecast MAPE	Approx. 18-22% on holdout period	Wide intervals reflect short history; treat as planning range

Manager Interpretation

The firm’s monthly revenue is not growing — it is fluctuating around a flat trend with a consistent seasonal dip in February and a collection gap of approximately 25% every month. Both patterns are directly actionable.

Trend: STL decomposition shows a modest upward trend in the first half of 2023 that has since plateaued. This is the strategic signal that justifies the classification and segmentation work in Sections 5 and 7: the firm cannot grow total revenue simply by taking on more work of the same type. It needs to shift the mix toward higher-value engagements.

Seasonality: The data shows a consistent dip in February and elevated activity in July and September. With 24 months, the seasonal pattern is emerging but not fully established — a further 12 months of data would confirm it. February’s weakness is likely structural: fewer client deadlines, post-holiday budget releases.

Collection gap: Every month, the firm collects approximately 73–76% of what it bills. The gap chart makes this visible in an executive format requiring no statistical knowledge. Closing this gap through faster invoicing, stricter payment terms, or automated reminders would add material cash to the firm without requiring a single additional engagement.

Forecast: The Holt-Winters 6-month forecast provides estimated revenue ranges for June–November 2025 with 80% and 95% prediction intervals. The wide intervals reflect the short history and volatile monthly counts. The central forecast is the planning estimate; the upper bound informs optimistic headcount and capacity decisions.

Question: Is your time series stationary? What transformation was required and why does stationarity matter for ARIMA?

Response: The ADF test on the level series returned p > 0.05, confirming non-stationarity The series has a trend component that violates ARIMA’s constant-mean assumption. First differencing removed the trend and produced a stationary series (p <0.05 on the differenced series).

Stationarity matters for ARIMA because the model’s auto-regressive and moving-average components assume that the statistical properties of the series — mean, variance, autocorrelation structure do not change over time. A trending series breaks this assumption and produces spurious parameter estimates and unreliable forecasts.

For this dataset, the short observation window (24 months) makes Holt-Winters Exponential Smoothing the preferred forecasting model over ARIMA: it handles trend and seasonality directly without requiring pre-differencing, is more stable with limited data, and produces interpretable smoothing parameters that correspond to natural business concepts (level, trend, seasonal adjustment). ARIMA remains appropriate for the ACF/PACF diagnostic analysis as a methodological complement.

The Analytical Chain: Connecting the Five Techniques

Each of the five techniques applied in this project was selected to answer a specific commercial question. Individually each delivers a useful output. Together they form a coherent analytical chain that moves from identification to explanation to segmentation to simplification to forecasting, producing a complete picture of Stransact’s revenue landscape that no single technique could provide alone.

Show Analysis Code

tibble(
  Step = c("1", "2", "3", "4", "5"),
  Technique = c(
    "Classification (S5)",
    "Explainability (S6)",
    "Clustering (S7)",
    "PCA (S8)",
    "Time Series (S9)"
  ),
  Question_Answered = c(
    "Which engagements are High Value?",
    "Why is an engagement High Value?",
    "Which client groups exist and how do they behave?",
    "What is the underlying structure of the engagement portfolio?",
    "Where is revenue heading and when does it peak or dip?"
  ),
  Key_Output = c(
    "Binary HV prediction; 95% accuracy; AUC 0.970",
    "Feature importance; SHAP waterfall; cash collection = top driver",
    "4 segments: Strategic, Efficient Mid-Tier, Collections Risk, Compliance",
    "4 components explain 80.2% of variance; PC1 = Financial Scale",
    "Flat trend since H2 2023; Feb dip; 25% monthly collection gap"
  ),
  Feeds_Into = c(
    "Provides the target label that Sections 6, 7 and 8 explain and contextualise",
    "Identifies which variables to prioritise in cluster profiling and PCA",
    "Cluster labels added as features to improve classification AUC",
    "Validates clustering separation; provides 2D partner portfolio view",
    "Confirms the revenue plateau identified in clustering and classification"
  )
) |>
  kable(col.names = c("Step", "Technique", "Question Answered",
                      "Key Output", "Feeds Into")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "1cm") |>
  column_spec(2, bold = TRUE, width = "3cm") |>
  column_spec(3, width = "4cm") |>
  column_spec(4, width = "5cm") |>
  column_spec(5, width = "5cm")

The Analytical Chain — How the Five Techniques Connect
Step	Technique	Question Answered	Key Output	Feeds Into
1	Classification (S5)	Which engagements are High Value?	Binary HV prediction; 95% accuracy; AUC 0.970	Provides the target label that Sections 6, 7 and 8 explain and contextualise
2	Explainability (S6)	Why is an engagement High Value?	Feature importance; SHAP waterfall; cash collection = top driver	Identifies which variables to prioritise in cluster profiling and PCA
3	Clustering (S7)	Which client groups exist and how do they behave?	4 segments: Strategic, Efficient Mid-Tier, Collections Risk, Compliance	Cluster labels added as features to improve classification AUC
4	PCA (S8)	What is the underlying structure of the engagement portfolio?	4 components explain 80.2% of variance; PC1 = Financial Scale	Validates clustering separation; provides 2D partner portfolio view
5	Time Series (S9)	Where is revenue heading and when does it peak or dip?	Flat trend since H2 2023; Feb dip; 25% monthly collection gap	Confirms the revenue plateau identified in clustering and classification

10 What the Five Analyses Say — Individually

Classification Model

The Random Forest classifier identifies High Value engagements with 95% accuracy and AUC of 0.970. It demonstrates that the firm can reliably distinguish commercially strong engagements from weak ones based on observable variables at the time of engagement. Critically, 48% of the 200 engagements in the dataset qualify as High Value meaning the firm already has a strong base of valuable work. The challenge is not finding high-value engagements; it is doing more of them deliberately.

Model Evaluation and Explainability

SHAP analysis revealed that cash collection not fee size, service line, or client tier is the primary driver of high-value classification. Collected_Amt and Collection_Rate together account for approximately 26% of the model’s predictive power. Agreed_Fee ranks second. This finding has a sharp implication: two engagements with the same agreed fee will be classified differently based on how much is actually collected. Revenue quality, not revenue volume, is what separates High Value from Standard.

Clustering

K-Means segmentation with k=4 revealed four commercially distinct groups. Cluster A (26 engagements, 100% High Value) represents the firm’s strategic crown-jewel relationships — high-fee, gold-tier, well-collected IT and Audit work. Cluster C (69 engagements, 30% High Value) is the most urgent problem: reasonable fees but a 65.6% average collection rate. Cluster C alone suppresses the firm’s overall commercial performance. Cluster D (45 engagements) represents stable but low-margin compliance work that consumes capacity without generating high-value outcomes.

Dimensionality Reduction

PCA confirms that despite 12 financial and delivery variables, the engagement portfolio is fundamentally structured along two dimensions: financial scale (PC1, 34.4% of variance) and delivery volume (PC2, 25.1%). Four components explain 80.2% of all variation. The PCA biplot shows that High Value engagements cluster to the right of PC1 — confirming that financial scale is the dominant axis of commercial performance. However, the overlap between High Value and Standard engagements in PC space confirms that scale alone is insufficient — collection quality and utilisation efficiency also contribute to the classification boundary.

Time Series Analysis

STL decomposition and Holt-Winters forecasting reveal a revenue trajectory that has plateaued since mid-2023. The series shows no sustained growth trend — monthly agreed fees fluctuate around a flat mean with a consistent dip in February and elevated activity in July and September. More critically, the collection gap analysis shows that approximately 25% of billed revenue goes uncollected every month. Over a 12-month period this represents a material cash leakage that compounds the flat revenue trend into an effective revenue decline in real terms.

10.1 The Convergent Story: What All Five Agree On

The five analyses converge on three findings that are consistent across every technique:

Finding 1

Collections are the most central commercial problem: The classification model identified collection rate as the second-most important predictor. The explainability analysis confirmed collected_amt as the top SHAP driver. Cluster C’s defining characteristic is a 65.6% collection rate. The time series showed a 25% monthly gap between billed and collected revenue. Every technique, from a different analytical angle, pointed to the same root cause: the firm is generating revenue, but it is not collecting.

Finding 2

The firm has a portfolio mix problem, not a volume problem: 48% of engagements were already High Value. The classification model, the cluster profiles and the PCA all confirmed that high-value work exists in the firm and is identifiable. The issue was that 69 engagements (Cluster C) and 45 engagements (Cluster D) consumed significant delivery capacity while generating low HV rates of 30% and 40% respectively. The time series confirmed no revenue growth; the flat trend reflects a portfolio weighted toward standard and compliance work rather than strategic and advisory work.

Finding 3

The drivers of high value are known and actionable: The explainability analysis identified the top predictors: collected_amt, agreed_fee, collection_rate, avg_billing_rate, and client_tier. These are not fixed characteristics, they are variables the firm can influence through pricing decisions, billing discipline, client selection, and collections management. The cluster analysis showed that Cluster A engagements (100% HV) shared identifiable commercial characteristics that could be replicated - high agreed fees, gold-tier clients, strong billing rates, and collection rates above 78%.

10.2 Recommendation

Stransact should implement a Revenue Quality Programme — a structured initiative that simultaneously improves collections on existing engagements (targeting Cluster C), shift new engagement intake toward the commercial profile of Cluster A, and use the classification model as a pre-acceptance screen for all proposals above ₦1m in agreed fee. This single recommendation is supported by all five analyses as follows:

Show Analysis Code

tibble(
  Technique = c(
    "Classification",
    "Explainability",
    "Clustering",
    "PCA",
    "Time Series"
  ),
  Evidence = c(
    "95% accuracy predicting HV; strong features identifiable pre-engagement",
    "Cash collection drives HV more than fee size; top-5 features are all measurable",
    "Cluster C (n=69) has 30% HV rate and 65.6% collection rate",
    "Financial scale (PC1) separates HV from Standard in 2D space",
    "Flat revenue trend; 25% monthly collection gap; Feb dip confirmed"
  ),
  Action_It_Supports = c(
    "Deploy model as a proposal screen: score every new engagement before acceptance",
    "Set minimum collection rate targets (85%+) as a condition of engagement acceptance",
    "Launch a 30-day collections intervention on all Cluster C engagements immediately",
    "Use PCA biplot quarterly as a portfolio health review with partners",
    "Set monthly collection targets; run proactive BD in January to close February gap"
  )
) |>
  kable(col.names = c("Technique", "Evidence", "Action It Supports")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "bordered"),
    full_width        = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "2.5cm") |>
  column_spec(2, width = "7cm") |>
  column_spec(3, width = "7cm")

Integrated Findings — Five Techniques
Technique	Evidence	Action It Supports
Classification	95% accuracy predicting HV; strong features identifiable pre-engagement	Deploy model as a proposal screen: score every new engagement before acceptance
Explainability	Cash collection drives HV more than fee size; top-5 features are all measurable	Set minimum collection rate targets (85%+) as a condition of engagement acceptance
Clustering	Cluster C (n=69) has 30% HV rate and 65.6% collection rate	Launch a 30-day collections intervention on all Cluster C engagements immediately
PCA	Financial scale (PC1) separates HV from Standard in 2D space	Use PCA biplot quarterly as a portfolio health review with partners
Time Series	Flat revenue trend; 25% monthly collection gap; Feb dip confirmed	Set monthly collection targets; run proactive BD in January to close February gap

Show Analysis Code

data_clustered |>
  mutate(
    HV = as.numeric(as.character(high_value_engagement))
  ) |>
  group_by(Cluster, service_line) |>
  summarise(HV_Rate = mean(HV) * 100,
            n = n(), .groups = 'drop') |>
  ggplot(aes(x = Cluster, y = HV_Rate,
             fill = service_line)) +
  geom_col(position = 'dodge', alpha = 0.85) +
  geom_hline(yintercept = 48, linetype = 'dashed',
             colour = 'grey40') +
  annotate('text', x = 0.6, y = 50,
           label = 'Firm avg: 48%', size = 3.5,
           colour = 'grey40') +
  scale_fill_manual(values = c(
    'Audit'         = '#1A3C6B',
    'Tax'           = '#5B2D8E',
    'Payroll'       = '#7A4A00',
    'IT Consulting' = '#145214')) +
  labs(title = 'High Value Rate by Cluster and Service Line',
       x     = 'Cluster',
       y     = 'High Value Rate (%)',
       fill  = 'Service Line') +
  theme_minimal(base_size = 12)

Figure 20: High Value Rate by Cluster and Service Line

10.3 Implementation Roadmap

The Revenue Quality Programme should be structured in three phases aligned to the analytical findings:

Phase 1 — Immediate (0–30 days): Collections Intervention on Cluster C

Identify all 69 Cluster C engagements using the cluster model output and • assign a dedicated collections review to each engagement within 30 days; • set a firm-wide collection rate floor of 85% for all active engagements; • implement automated 14-day and 30-day invoice reminders for all outstanding balances

Expected impact: closing the collection gap from 25% to 15% on Cluster C alone would recover approximately ₦6–8m in monthly cash without winning a single new engagement.

Phase 2 — Short-term (1–3 months): Proposal Screening with the Classification Model

Integrate the Random Forest model into the engagement acceptance process • Run every proposal above ₦1m through the model before partner sign-off; • Use the SHAP waterfall output to brief partners on the top commercial risk factors for each proposal • Set a minimum predicted HV probability threshold of 0.60 for new engagements

Expected impact: gradually shifting the portfolio mix toward Cluster A characteristics — higher fees, stronger collection rates, gold-tier clients and compounding the revenue quality improvement from Phase 1.

Phase 3 — Medium-term (3–6 months): Strategic Portfolio Re-balancing

Use the PCA biplot as a quarterly portfolio review tool at partner level • Set explicit targets to grow Cluster A from 26 to 35+ engagements over 12 months; • Review Cluster D (high-volume compliance) for repricing or capacity re-allocation; • Refresh the time series model monthly to track whether the flat revenue trend turns upward; • Run a 6-month forecast at the start of each quarter to guide staffing and BD investment.

Expected impact: a portfolio that generates higher revenue per engagement, stronger cash conversion, and a measurable upward trend in monthly agreed fees evidenced by the time series model’s forecast shifting upward as the mix changes.

10.4 Conclusion

The analysis began with a single question: how can engagement analytics predict high-value client work, uncover commercially significant patterns, and support strategic revenue planning at Stransact?

The answer, delivered across five techniques and 200 engagement records, is precise. The firm does not have a revenue generation problem, it has a revenue quality and retention problem. It wins engagements. It delivers them but it does not consistently collect what it bills. It does not also systematically prioritise the type of work that generates the strongest commercial outcomes.

The classification model knows what high-value work looks like. The explainability analysis knows what drives it. The cluster model knows where the problems are concentrated. PCA knows how to show it visually. And the time series knows when the firm is most vulnerable. Together they point to one action: implement a Revenue Quality Programme that uses this analytical infrastructure not as a one-time project, but as a standing management tool.

The model should be re-trained quarterly. The clusters should be reviewed monthly. The time series should be updated and forecast at the start of every quarter. The analytics do not replace partner judgment, they make it faster, more consistent, and more defensible.

Stransact’s path to revenue growth does not run through winning more engagements. It runs through collecting what it has already won, selecting future engagements more deliberately, and managing the portfolio as a strategic asset rather than an accumulation of individual client relationships. The data makes this clear. The recommendation is to act on it.

11 Limitations of the Study

Despite yielding commercially valuable insights into the drivers of high-value engagements at Stransact, several limitations could be noted with regard to the research process and findings;

To begin with, the study relied on a relatively small operational sample, of 200 engagement observations accumulated over a two-year period between May 2023 and May 2025. Although sufficient for exploratory data analysis, classification modelling, clustering and forecasting, this sample size will most likely be insufficient for model stability and predictive robustness over longer periods.

Professional services providers’ revenue are often affected by various factors such as economic cycles, regulatory deadlines and advisory needs which may have been better explored using a larger dataset spanning multiple business cycles.

In addition, the dataset consisted of operational records of a single professional services firm. As a result, the findings pertain specifically to the operational structure, pricing, services and client portfolio of Stransact. This limits the potential applicability to other companies in the market; While many of the insights may be applicable in comparable firms, the results cannot automatically be extended to all professional services firm without further investigation.

Third, several variables known to affect engagement value were missing from the operational records provided to the author. These include: indicators of client satisfaction, proposal conversion rates, relationship strength with partners, macroeconomic conditions, and demand shocks in specific sectors. The models in this case, focus on commercially observable operational patterns rather than the broader strategic context that drives client value creation.

Fourth, although the Random Forest classification model yielded impressive results, it’s explainability through SHAP analysis and variable importance remains probabilistic in nature. In other words, the models help identify variables related to commercially successful engagements, but they do not establish causation. For instance, the model indicates a very strong correlation between high collection rates and high-value engagements, but it does not prove that collections alone are responsible for the strategic value of engagements.

Furthrmore, due to the relatively short span of available monthly revenue records, the Holt-Winters and ARIMA models are likely sufficient for short-term operational forecasting. However, their forecast confidence intervals are relatively wide, which can be attributed to the relatively small number of observations used. A longer revenue history would yield more accurate trend estimates, seasonality patterns and forecasts.

In conclusion, the study was performed under the practical constraints of an MBA analytics project. Consequently, the analysis focused on efficient and interpretable machine learning models rather than computationally intensive deep learning and ensemble methods.

11.1 Further Work

Several avenues of research could be pursued in future findings.

To begin with, extending the observation period to five or more years would enable better analysis the dataset longitudinally and organizationally. It will help predict the revenue trends over time, recurring compliance cycles, client retention dynamics, and sensitivity to macroeconomic changes.

Similarly, the addition of operational records from other professional services firms would facilitate comparative analyses and increase the external validity of the models developed.

Future studies could also introduce more behavioural and relational variables, which should include: success rates of proposals, satisfaction levels of clients, engagement turnaround times, level of partner involvement, delayed payment patterns, and referral generation. These variables would help understand how commercial relationships develop within professional services firms.

In terms of modeling techniques, more advanced machine learning architectures could be deployed if more computing power and data becomes available. Some of the possible directions for future research include: gradient boosting machines (XGBoost and LightGBM), ensemble learning frameworks, bayesian forecasting models, neural network-based time series models, and survival models for client retention. These techniques may lead to increased predictive performance in larger datasets.

Furthermore, future projects could aim to implement real-time operational dashboards that are integrated directly into business development and engagement management processes. In contrast to this project, which focuses on retrospective reporting, the next steps could involve implementing tools to score ongoing engagements and prioritize clients based on revenue intelligence.

Another useful direction for further research would be integrating financial forecasting with workforce planning and capacity optimization models. As the study revealed, utilization, billing efficiency and collections were the key commercial drivers. Therefore, connecting revenue intelligence with workforce allocation could optimize consulting and compliance teams’ deployment.

Lastly, it would be useful to assess the organizational impact of introducing revenue intelligence systems. For example, it could evaluate if using predictive engagement scoring and client segmentation contribute to revenue growth, profitability, collections, client retention, and expansion of service lines.

References

Adubi, O. (2026). Anonymised Stransact Revenue Intelligence Engagement Dataset. Collected from Stransact, Lagos, Nigeria. Data available on request from the author.

Boehmke, B., & Greenwell, B. M. (2020). Hands-on machine learning with R. CRC Press.

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer.

Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles in R. https://www.tidymodels.org/

Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble

Pedersen, T. L. (2024). patchwork: The composer of plots (R package version 1.2.0). https://CRAN.R-project.org/package=patchwork

Robinson, D., & Hayes, A. (2024). broom: Convert statistical analysis objects into tidy tibbles (R package version 1.0.6). https://CRAN.R-project.org/package=broom

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

Yu, G. (2024). ggplotify: Convert plot to ggplot object (R package version 0.1.2). https://CRAN.R-project.org/package=ggplotify

Zwillinger, D., & Kokoska, S. (2000). CRC standard probability and statistics tables and formulae. CRC Press.