Strategic Revenue Intelligence in Professional Services: Predicting and Scaling High-Value Client Engagements
Author
Oluwaseun Adubi
Published
May 22, 2026
Executive Summary
This study examines how engagement-level operational and financial data can be used to predict high-value client engagements within a mid-tier professional services firm. Using anonymised engagement records, the analysis applies predictive modelling, clustering, dimensionality reduction, and time-series forecasting techniques to generate strategic revenue intelligence insights.
The study aims to identify the operational characteristics associated with commercially successful engagements, improve engagement selection and staffing decisions, and support forward-looking revenue planning. Analytical techniques including logistic regression, random forest classification, clustering analysis, principal component analysis (PCA), and ARIMA forecasting are applied to evaluate both engagement profitability drivers and future revenue patterns.
The findings are intended to support evidence-based decision-making in client portfolio management, resource allocation, pricing strategy, and long-term business development planning.
1 Professional Disclosure
I work as a Manager at Stransact Services Limited, a mid-tier professional services firm whose service offerings span tax advisory, regulatory compliance, bookkeeping, audit and assurance support, IT and payroll consulting. My role sits at the intersection of client delivery, commercial execution and management decision support. The firm’s operating model depends on how effectively engagements are priced, staffed, delivered and retained, which means engagement-level data is directly relevant to my day-to-day responsibilities.
This project is not an abstract exercise. It addresses a genuine management problem: identifying which clients, services, pricing approaches and delivery structures generate high-value work, and how those patterns can be replicated deliberately. The five selected techniques- classification, model evaluation and explainability, clustering, dimensionality reduction, and time series analysis map directly to that objective.
2 Data Collection and Sampling
This study uses a structured, anonymised engagement-level dataset derived from the operational records of Stransact, a mid-tier professional services firm providing tax advisory, regulatory compliance, bookkeeping, audit support, payroll processing and IT consulting-related services. The project was designed to address a commercially important management problem: understanding which engagements generate sustainable financial value and identifying the operational drivers associated with commercially successful client work.
The analysis focuses on the concept of revenue intelligence - the use of operational, financial and delivery data to improve engagement selection, pricing discipline, collections performance and strategic portfolio management within a professional services environment.
Unlike traditional financial reporting, which focuses primarily on historical outcomes, this project applies predictive analytics, segmentation, explainability modelling and forecasting techniques to support forward-looking commercial decision-making.
2.1 Source of Data
The dataset was compiled from three internally maintained operational sources within Stransact.
The first source was the engagement and billing records, which provided variables relating to agreed fees, invoiced amounts, collected revenue, outstanding balances, pricing structure and billing rates. These records formed the core financial component of the analysis and allowed the study to evaluate engagement profitability, fee recovery performance and revenue concentration patterns.
The second source was the client profile records, which provided anonymised client attributes including client tier classification, industry grouping, tenure category and engagement relationship indicators. These variables were important in assessing whether commercially valuable engagements were associated with specific categories of clients or service relationships.
The third source was the staff utilisation and engagement delivery records, which provided operational variables such as budgeted hours, actual hours worked, billable hours, utilisation rates, realisation rates and engagement duration. These records allowed the study to evaluate operational efficiency alongside financial outcomes.
The integration of these three operational sources created a commercially meaningful dataset capable of linking financial performance, operational efficiency, client quality, delivery effort, and revenue outcomes within a single analytical framework.
2.2 Data Collection Method
Relevant engagement records were exported from the firm’s operational systems into spreadsheet format before being consolidated into a single analysis-ready dataset.
The extraction process used Engagement_ID as the primary engagement-level identifier and Client_ID as the anonymised client reference field. To maintain confidentiality, all client names and identifiable business information were removed prior to analysis and replaced with coded identifiers such as CLT_001, CLT_002 and CLT_003.
The consolidated dataset was then cleaned and validated within RStudio. Duplicate records, incomplete administrative entries and engagements lacking core financial information were excluded. Missing values were assessed and handled during pre-processing to ensure compatibility with downstream modelling techniques.
Several commercially important variables were engineered during pre-processing, including:
The High_Value_Engagement variable was constructed as a binary commercial classification target representing engagements that exceeded internally defined thresholds for:
fee size, collection quality, operational efficiency, and commercial contribution.
To improve modelling stability and analytical completeness, limited synthetic augmentation was applied to selected observations while preserving the operational structure and statistical behaviour of the original engagement records.
2.3 Sampling Frame
The sampling frame consisted of completed or substantially completed engagements recorded within Stransact’s operational systems during the selected study period.
The population included engagements across multiple service lines including:
Tax, Audit, Payroll, IT Consulting, and Advisory services.
The sampling frame excluded cancelled engagements, duplicate records, non-billable internal work, and engagements missing core financial variables.
The unit of analysis was defined as one client engagement or one engagement-month where engagements extended across multiple reporting periods.
This structure was appropriate because operational and financial performance within professional services firms is typically monitored at engagement level rather than transaction level.
2.4 Sample Size
Show Analysis Code
tibble(Dimension =c("Total engagements","Unique clients","Variables","High Value (class 1)","Standard (class 0)","Monthly periods","Date range"),Value =c("200","85 (CLT-001 to CLT-110)","28","98 (49%)","102 (51%)","24 consecutive months","January 2024 to December 2025")) |>kable(col.names =c("Dimension","Value")) |>kable_styling(bootstrap_options =c("striped","hover"),full_width =FALSE) |>column_spec(1, bold =TRUE, width ="6cm") |>column_spec(2, width ="8cm")
Dataset Overview
Dimension
Value
Total engagements
200
Unique clients
85 (CLT-001 to CLT-110)
Variables
28
High Value (class 1)
98 (49%)
Standard (class 0)
102 (51%)
Monthly periods
24 consecutive months
Date range
January 2024 to December 2025
The final dataset consisted of 200 anonymised engagement records. This sample size was considered appropriate because it was large enough to support segmentation, classification and clustering analysis while remaining operationally manageable for detailed validation and interpretation.
It also provided enough variation across client tiers, engagement sizes, service lines, billing structures,and operational delivery profiles to allow meaningful comparative analysis.
2.5 Time Period Covered
The dataset covered engagements occurring between January 2024 to December 2025, providing a 24-month operational observation window. This period was appropriate for several reasons namely;
it captured recurring annual compliance cycles common within professional services work;
it included both high-activity and low-activity billing periods, showcasing seasonal behaviour and collection fluctuations.
it provided sufficient monthly observations to support the forecasting and time-series requirements of the study.
2.6 Ethical Considerations and Confidentiality
The study was conducted in accordance with confidentiality, responsible data use and data minimisation principles.
No client names, tax identification numbers, contact information, advisory memoranda or commercially sensitive narratives were included in the analytical dataset. All records were anonymised prior to analysis, and only aggregated findings, visualisations and model outputs are presented within the final report.
A formal confidentiality statement for the project is presented below:
The dataset used for this study was extracted from internal engagement, billing and utilisation records of Stransact Services Limited strictly for academic and analytical purposes. All client identifiers were anonymised before analysis. Client names, contact details, tax identifiers, confidential advisory content and commercially sensitive narratives were excluded from the final dataset. Each client was represented using a coded Client_ID, and all analysis was performed at aggregated engagement level. The underlying operational data is not publicly available due to confidentiality restrictions and may only be reviewed by authorised academic assessors where necessary.
3 Dataset Description and Exploratory Data Analysis
3.1 Dataset Structure
The final analytical dataset contained operational, financial and engagement-delivery variables designed to explain commercially valuable client work.
The variables collectively captured financial scale, collections quality, operational efficiency, staffing effort, pricing discipline, and engagement outcomes.
3.2 Variable Names, Types and Operational Meaning
Show Analysis Code
library(tibble)library(knitr)library(kableExtra)variable_tbl <-tribble(~`Variable Name`, ~Type, ~Description, ~`Operational Relevance`,"Engagement_ID","Character / Identifier","Unique reference number for each engagement","Distinguishes one engagement from another and supports data traceability","Client_ID","Character / Identifier","Anonymised client reference","Allows client-level analysis without disclosing client names","Industry","Categorical","Sector in which the client operates","Helps identify industries associated with stronger revenue or margins","Client_Size","Categorical","Size band of the client, such as small, medium or large","Supports comparison of commercial value across client categories","Client_Tenure","Numeric","Length of client relationship, usually measured in months or years","Indicates whether longer relationships produce stronger repeat work or profitability","Service_Type","Categorical","Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting","Helps determine which service lines contribute most to firm value","Sub_Service","Categorical","More detailed service category under the main service type","Provides more granular insight into specific offerings","Pricing_Model","Categorical","Basis of pricing, such as fixed fee, hourly, retainer or blended pricing","Supports pricing discipline and margin analysis","Revenue","Numeric","Fee income generated from the engagement","Measures commercial value and supports classification of high-value engagements","Cost","Numeric","Direct cost or estimated delivery cost of the engagement","Allows profitability to be assessed beyond revenue alone","Profit","Numeric","Revenue less cost","Measures absolute financial contribution","Profit_Margin","Numeric","Profit divided by revenue","Measures efficiency and quality of earnings","Duration_Months","Numeric","Length of the engagement in months","Helps assess whether longer engagements produce better or weaker commercial outcomes","Team_Size","Numeric","Number of staff involved in delivering the engagement","Supports analysis of resource deployment","Hours_Billed","Numeric","Total hours charged or recorded on the engagement","Measures effort intensity and delivery efficiency","Client_Retention","Categorical / Binary","Indicates whether the client was retained","Supports analysis of client relationship strength","Repeat_Engagement","Categorical / Binary","Indicates whether the client gave repeat work","Captures recurring commercial value","Engagement_Month","Date","Month in which the engagement was recorded or billed","Supports time series analysis and revenue trend review")variable_tbl |> knitr::kable() |>kable_styling(bootstrap_options =c("striped", "hover"),full_width =FALSE )
Dataset Variable Definitions and Operational Relevance
Variable Name
Type
Description
Operational Relevance
Engagement_ID
Character / Identifier
Unique reference number for each engagement
Distinguishes one engagement from another and supports data traceability
Client_ID
Character / Identifier
Anonymised client reference
Allows client-level analysis without disclosing client names
Industry
Categorical
Sector in which the client operates
Helps identify industries associated with stronger revenue or margins
Client_Size
Categorical
Size band of the client, such as small, medium or large
Supports comparison of commercial value across client categories
Client_Tenure
Numeric
Length of client relationship, usually measured in months or years
Indicates whether longer relationships produce stronger repeat work or profitability
Service_Type
Categorical
Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting
Helps determine which service lines contribute most to firm value
Sub_Service
Categorical
More detailed service category under the main service type
Provides more granular insight into specific offerings
Pricing_Model
Categorical
Basis of pricing, such as fixed fee, hourly, retainer or blended pricing
Supports pricing discipline and margin analysis
Revenue
Numeric
Fee income generated from the engagement
Measures commercial value and supports classification of high-value engagements
Cost
Numeric
Direct cost or estimated delivery cost of the engagement
Allows profitability to be assessed beyond revenue alone
Profit
Numeric
Revenue less cost
Measures absolute financial contribution
Profit_Margin
Numeric
Profit divided by revenue
Measures efficiency and quality of earnings
Duration_Months
Numeric
Length of the engagement in months
Helps assess whether longer engagements produce better or weaker commercial outcomes
Team_Size
Numeric
Number of staff involved in delivering the engagement
Supports analysis of resource deployment
Hours_Billed
Numeric
Total hours charged or recorded on the engagement
Measures effort intensity and delivery efficiency
Client_Retention
Categorical / Binary
Indicates whether the client was retained
Supports analysis of client relationship strength
Repeat_Engagement
Categorical / Binary
Indicates whether the client gave repeat work
Captures recurring commercial value
Engagement_Month
Date
Month in which the engagement was recorded or billed
Supports time series analysis and revenue trend review
3.3 Target Variable Construction
Show Analysis Code
tibble(Factor =c("Fee size","Realisation %","Collection rate","Utilisation %","Client tier","Client size","Service line","THRESHOLD"),Variable =c("agreed_fee_ngn000","realisation_percent","collection_rate_percent","utilisation_percent","client_tier","client_size","service_line","—"),Max_Points =c(35,20,20,10,8,5,2,100),Scoring_Rule =c("(fee / 15000) x 35","min(realisation/100, 1) x 20","min(collection/100, 1) x 20","min(utilisation/100, 1) x 10","Gold=8, Silver=5, Bronze=2","Large Ent=5, Mid-Market=3, SME=1","IT Consulting=2, Audit/Tax=1, Payroll=0","Score >= 59 → High Value (1)" )) |>kable(col.names =c("Factor","Variable","Max Points","Scoring Rule")) |>kable_styling(bootstrap_options =c("striped","hover"),full_width =FALSE) |>row_spec(8, bold =TRUE, background ="#1F3864", color ="white") |>column_spec(1, bold =TRUE)
Distribution of Key Financial Variables (NGN ’000)
3.7 Performance Ratio Distributions
Show Analysis Code
data |>select(utilisation_percent, realisation_percent, collection_rate_percent) |>pivot_longer(everything(), names_to ="Ratio", values_to ="Value") |>mutate(Ratio =case_when( Ratio =="utilisation_percent"~"Utilisation %", Ratio =="realisation_percent"~"Realisation %", Ratio =="collection_rate_percent"~"Collection Rate %" )) |>ggplot(aes(x = Value, fill = Ratio)) +geom_histogram(bins =20, colour ="white", alpha =0.85) +facet_wrap(~ Ratio, scales ="free") +scale_fill_manual(values =c("Utilisation %"="#5B2D8E","Realisation %"="#7A4A00","Collection Rate %"="#7A0000")) +labs(title ="Utilisation, Realisation and Collection Rate Distributions",x ="Rate (%)", y ="Engagements") +theme_minimal(base_size =12) +theme(legend.position ="none")
Distribution of Key Performance Ratios (%)
3.8 Target Class Distribution
Show Analysis Code
data |>mutate(hv =as.character(high_value_engagement)) |>count(hv) |>mutate(Label =ifelse(hv =="HighValue", "High Value (1)", "Standard (0)"),Pct =round(n /sum(n) *100, 1) ) |>ggplot(aes(x = Label, y = n, fill = Label)) +geom_col(width =0.5, show.legend =FALSE) +geom_text(aes(label =paste0(n, " (", Pct, "%)")),vjust =-0.4, size =4.5, fontface ="bold") +scale_fill_manual(values =c("High Value (1)"="#145214","Standard (0)"="#D4813A")) +labs(title ="Class Distribution — High_Value_Engagement",x =NULL, y ="Count") +theme_minimal(base_size =12) +ylim(0, 130)
Class Distribution — High_Value_Engagement
3.9 Fee by Service Line and Client Tier
Show Analysis Code
data |>ggplot(aes(x = service_line, y = agreed_fee_ngn000, fill = client_tier)) +geom_boxplot(alpha =0.8, outlier.size =1.5) +scale_y_continuous(labels =label_comma()) +scale_fill_manual(values =c("Bronze"="#CD7F32","Silver"="#A8A9AD","Gold"="#FFD700")) +labs(title ="Agreed Fee by Service Line and Client Tier",x ="Service Line", y ="Agreed Fee (NGN '000)",fill ="Client Tier") +theme_minimal(base_size =12) +coord_flip()
Agreed Fee Distribution by Service Line and Client Tier
3.10 Exploratory Data Analysis
The exploratory analysis identified several commercially important patterns;
First, engagement revenue was highly concentrated, with a relatively small number of large engagements accounting for a disproportionate share of total commercial value.
Second, collections performance varied substantially across engagements, suggesting that headline fee size alone was an incomplete measure of commercial quality.
Third, utilisation and billing efficiency differed materially across service lines, indicating that operational effort and financial return were not perfectly aligned.
The EDA also revealed evidence of: revenue volatility, concentration risk, uneven collection behaviour, and operational heterogeneity across engagement groups.
These findings provided the foundation for the predictive, segmentation and forecasting analyses developed in later sections.
4 Classification Model
4.1 Theory Recap
Classification modelling is used to predict whether an observation belongs to a predefined category. The target variable High_Value_Engagement classified engagements into:High Value, or Standard.
Two supervised learning architectures were evaluated:
-Logistic Regression, -Random Forest.
Logistic Regression was selected as an interpretable baseline model, while Random Forest was used to capture non-linear interactions and more complex commercial relationships within the engagement portfolio.
4.2 Business Justification
The classification framework was developed to support engagement-level commercial decision-making.
The objective was not merely to predict outcomes retrospectively, but to provide management with a forward-looking mechanism for identifying commercially attractive engagements before delivery resources are committed.
A reliable classification model allows the firm to:
The central business question addressed in this section was Which operational and financial characteristics distinguish commercially valuable engagements from standard work?
4.3 Classification Pipeline
The analytical pipeline included:
train-test splitting, preprocessing, model training, model evaluation, ROC analysis, confusion matrix evaluation, and cross-validation.
4.4 Output
4.4.1 Classification Setup
Show Analysis Code
# Remove any remaining NAs before splittingdf_model <- data |>drop_na() |>select(-engagement_id, -client_id, -engagement_start,-engagement_end, -hv_score)set.seed(42)split <-initial_split(df_model, prop =0.80,strata = high_value_engagement)df_train <-training(split)df_test <-testing(split)cat("Training:", nrow(df_train), "| Test:", nrow(df_test))
Training: 159 | Test: 41
Show Analysis Code
cat("\nNAs in training set:", sum(is.na(df_train)))
NAs in training set: 0
Show Analysis Code
rec <-recipe(high_value_engagement ~ ., data = df_train) |>step_impute_median(all_numeric_predictors()) |>step_impute_mode(all_nominal_predictors()) |>step_novel(all_nominal_predictors()) |>step_dummy(all_nominal_predictors(),one_hot =FALSE) |>step_zv(all_predictors()) |>step_normalize(all_numeric_predictors())# Confirm recipe bakes without errorsrec |>prep() |>bake(new_data =NULL) |>anyNA() |> (\(x) cat("NAs after recipe:", x, "\n"))()
The Random Forest architecture produced the strongest overall performance:
95% accuracy, AUC-ROC of 0.970, and strong balance between precision and recall.
Although the numerical improvement relative to Logistic Regression appeared modest, the commercial implications justified the additional model complexity.
In a professional services environment, misclassification carries direct operational consequences incorrectly prioritising weak engagements consumes scarce partner capacity, while failing to identify strategically valuable engagements weakens revenue concentration within the firm’s highest-performing portfolio segment.
The Random Forest model is therefore recommended for deployment because it:
captures non-linear operational relationships, improves classification reliability, and supports more defensible engagement prioritisation.
5 Model Evaluation & Explainability
5.1 Business Objective
While classification modelling determines whether engagements are likely to be commercially valuable, explainability analysis identifies why.
This distinction is strategically important.
Without explainability, predictive modelling functions as a black-box exercise. With explainability, the model becomes:
transparent, commercially interpretable, and operationally actionable.
5.2 SHAP Analysis
SHAP analysis was used to decompose model predictions into feature-level contributions.
The SHAP framework identifies:
which variables push an engagement toward High Value, which variables weaken commercial quality, and the relative contribution of each feature.
5.3 Business Justification
Evaluation establishes whether the model is reliable enough to inform real decisions. Explainability converts the model’s logic into specific operational actions: tighten billing controls, prioritise Gold-tier clients, review scope management, target collection follow-up. Without explainability, a model produces a number; with it, the model produces a strategy.
Figure 1: Local Prediction Breakdown — Single Engagement (ENG-0001)
Show Analysis Code
tibble(Feature =c("collected_amt_ngn000","agreed_fee_ngn000","invoiced_amt_ngn000","collection_rate_percent","avg_billing_rate_ngn","outstanding_ngn000","client_tier","realisation_percent"),Importance =c(0.173,0.158,0.125,0.091,0.083,0.052,0.041,0.040),Direction =c("Higher = more likely HV","Higher = more likely HV","Higher = more likely HV","Higher = more likely HV","Higher = more likely HV","Higher = less likely HV","Gold > Silver > Bronze","Higher = more likely HV"),Signal =c("Cash recovery is the primary commercial differentiator","Larger engagements are structurally more likely to qualify","Confirms billing follow-through matters commercially","Strong independent signal beyond raw fee size","Premium billing rates — especially IT Consulting — lift HV probability","Large unpaid balances are a HV disqualifier","Tier matters, but only when it translates to clean financials","Invoicing close to agreed fee signals commercial discipline" )) |>kable(col.names =c("Feature","Importance","Direction","Management Signal")) |>kable_styling(bootstrap_options =c("striped","hover"),full_width =TRUE, font_size =12) |>column_spec(1, monospace =TRUE, bold =TRUE)
Top 8 Features — Importance and Management Signal
Feature
Importance
Direction
Management Signal
collected_amt_ngn000
0.173
Higher = more likely HV
Cash recovery is the primary commercial differentiator
agreed_fee_ngn000
0.158
Higher = more likely HV
Larger engagements are structurally more likely to qualify
Premium billing rates — especially IT Consulting — lift HV probability
outstanding_ngn000
0.052
Higher = less likely HV
Large unpaid balances are a HV disqualifier
client_tier
0.041
Gold > Silver > Bronze
Tier matters, but only when it translates to clean financials
realisation_percent
0.040
Higher = more likely HV
Invoicing close to agreed fee signals commercial discipline
5.5 Interpretation for Management
The explainability analysis revealed that revenue quality matters more than revenue size alone.
The strongest predictor of High Value classification was not simply the agreed fee, but the amount successfully collected from the client.
This finding has major commercial implications.
An engagement billed at ₦8m with weak collection performance may be commercially inferior to a ₦3m engagement with strong cash conversion and efficient delivery.
For a non-technical board presentation, the most effective SHAP output would be the waterfall plot for one representative engagement.
The waterfall plot answers a practical business question Why did the model classify this engagement as commercially valuable?
This makes the model transparent and commercially defensible for:
partner review, pricing decisions, and engagement approval.
6 Customer and Engagement Segmentation (Clustering)
6.1 Theory Recap
Not all engagements should be managed identically. The objective of clustering analysis was to identify naturally occurring commercial segments within the engagement portfolio. Rather than imposing predefined categories, K-Means clustering allowed the data itself to reveal:
The 200 engagements span four service lines, three client tiers, six industries and a wide fee range. Not all should be managed the same way.
Clustering lets the data reveal naturally occurring groups without the analyst imposing assumptions. The output is a commercially grounded segmentation framework that partners can use to tailor service delivery, pricing strategy and client attention by segment — rather than by individual client intuition.
6.3 Output
Show Analysis Code
cluster_vars <-c("agreed_fee_ngn000","realisation_percent","collection_rate_percent","utilisation_percent","avg_billing_rate_ngn","staff_count")data_cluster <- data |>select(all_of(cluster_vars))df_scaled <-scale(data_cluster)
Show Analysis Code
set.seed(42)fviz_nbclust( df_scaled, kmeans,method ="wss",k.max =8,linecolor ="#7A4A00") +labs(title ="Elbow Method – Optimal k Selection",x ="Number of Clusters (k)",y ="Total Within-Cluster Sum of Squares" ) +theme_minimal(base_size =12)
Figure 2: Elbow Plot — Within-Cluster Sum of Squares by k
Classification AUC — With vs Without Cluster Feature
Model
AUC
Improvement
RF without cluster feature
0.026
—
RF with cluster feature
0.038
+1.19 pts
6.3.1 Cluster Interpretation
The clustering analysis revealed substantial commercial heterogeneity hidden by firm-level averages.
Aggregate statistics suggested moderate collection performance across the portfolio. However, clustering revealed that weak collection behaviour was highly concentrated within Cluster C. This distinction materially changes the management response.
Rather than implementing firm-wide interventions, management can target specific engagement groups, operational behaviours, and client categories.
Cluster A represented the firm’s strategically valuable portfolio segment. These engagements displayed strong fees, high billing rates, strong collections, and premium client relationships.
Cluster C represented the greatest commercial risk.
Cluster membership was subsequently encoded as a predictive feature within the classification model.
This approach allowed the supervised model to benefit from the structural patterns identified through unsupervised segmentation.
Operationally, this improved the model’s ability to recognise engagement archetypes, commercial behaviour patterns, and portfolio structures.
7 Dimensionality Reduction (PCA)
7.1 Theory Recap
Principal Component Analysis (PCA) rotates the original feature space into a new set of orthogonal axes — Principal Components (PCs) — ordered by variance explained. The first PC captured the largest variance; each subsequent PC captures the largest remaining variance while remaining uncorrelated with all previous components.
PCA is particularly valuable when features are correlated — structurally true here because Agreed_Fee, Invoiced_Amt and Collected_Amt all measure related aspects of the same transaction. PCA collapses these correlated signals into independent dimensions, reducing redundancy before clustering or visualisation. A scree plot shows variance explained per component; the convention is to retain components explaining at least 80% of total variance collectively.
7.2 Business Justification
With 12 numeric variables across financial, utilisation and staffing dimensions, direct interpretation is unwieldy. PCA extracts the underlying commercial structure and summarises it in two or three interpretable dimensions. This serves two purposes: it validates the clustering segmentation visually by showing where clusters sit in the reduced space; and it produces a 2D portfolio map that partners can use in strategic reviews without needing statistical expertise.
7.3 Output
Show Analysis Code
pca_vars <-c("agreed_fee_ngn000","invoiced_amt_ngn000","collected_amt_ngn000","outstanding_ngn000","realisation_percent","collection_rate_percent","avg_billing_rate_ngn","utilisation_percent","budgeted_hours","actual_hours","billable_hours","staff_count")pca_rec <-recipe(~ ., data = data |>select(all_of(pca_vars))) |>step_normalize(all_numeric_predictors()) |>step_pca(all_numeric_predictors(), num_comp =6)pca_prep <-prep(pca_rec)pca_scores <-bake(pca_prep, new_data =NULL)# Variance tabletidy(pca_prep, number =2, type ="variance") |>filter(terms =="percent variance") |>select(component, value) |>mutate(value =round(value, 1),cumulative =round(cumsum(value), 1) ) |>kable(col.names =c("Component","% Variance","Cumulative %")) |>kable_styling(full_width =FALSE)
Component
% Variance
Cumulative %
1
36.4
36.4
2
23.5
59.9
3
11.4
71.3
4
8.8
80.1
5
8.3
88.4
6
7.4
95.8
7
2.1
97.9
8
1.2
99.1
9
0.9
100.0
10
0.0
100.0
11
0.0
100.0
12
0.0
100.0
Show Analysis Code
tidy(pca_prep, number =2, type ="variance") |>filter(terms =="percent variance") |>ggplot(aes(x = component, y = value)) +geom_col(fill ="#145214", alpha =0.8) +geom_line(aes(group =1), colour ="#0D5E5E", linewidth =0.8) +geom_point(colour ="#0D5E5E", size =3) +labs(title ="Scree Plot — Principal Component Variance",x ="Principal Component", y ="% Variance Explained") +theme_minimal(base_size =12)
Figure 5: Scree Plot — Variance Explained by Principal Component
Show Analysis Code
tidy(pca_prep, number =2) |>filter(component %in%c("PC1","PC2")) |>ggplot(aes(x = value, y =reorder(terms, abs(value)),fill = component)) +geom_col(show.legend =FALSE, alpha =0.85) +facet_wrap(~ component, scales ="free_x") +scale_fill_manual(values =c("PC1"="#145214","PC2"="#7A4A00")) +labs(title ="Variable Loadings — PC1 and PC2",x ="Loading", y =NULL) +theme_minimal(base_size =12)
Figure 7: PCA Biplot — Engagements Coloured by HV Status and Cluster
7.3.1 Management Interpretation
Despite 12 input variables, four underlying commercial dimensions capture 80%+ of all meaningful variation between engagements:
PC1 — Financial Scale (~34%): The single strongest pattern is how large the engagement is financially. Fee, invoiced amount and collected amount load heavily here. Engagements positioned to the right on PC1 are the firm’s largest revenue relationships.
PC2 — Delivery Volume (~25%): Independent of financial size, the second pattern is how many hours were committed. A high-fee engagement can be lean on hours (IT advisory) or hour-intensive (payroll). PC2 separates these — a commercially important distinction because hour-heavy, low-fee work carries different capacity and margin implications.
The biplot shows High Value engagements clustering to the right of the PC1 axis confirming that financial scale is the dominant structural separator between the two classes. The cluster shapes are well-separated in 2D space, validating the K-Means segmentation from Section 7.
8 Time Series Analysis
8.1 Theory Recap
Time series analysis decomposes a sequential record into three systematic components: trend (the long-run direction), seasonality (regular periodic fluctuations), and residual (irregular variation after trend and seasonality are removed). The Augmented Dickey-Fuller (ADF) test assesses stationarity; a constant mean and variance over time — which is a prerequisite for ARIMA modelling.
ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots reveal the lag structure of the series and guide ARIMA parameter selection. Holt-Winters Exponential Smoothing extends simple smoothing with trend and seasonal components; it is better suited to shorter series (21 monthly observations) than ARIMA, which typically requires 36+ periods for reliable parameterestimation.
8.2 Business Justification
Revenue in a professional services firm is not uniformly distributed. Audit engagements cluster around year-end reporting deadlines; tax work peaks around filing cycles; payroll is stable year-round; IT consulting is project-driven and episodic. A time series model quantifies these patterns, distinguishing genuine growth trends from seasonal noise. The output — a forecast with prediction intervals — provides management with a defensible basis for staffing, capacity planning and budget-setting decisions.
8.3 Output
Show Analysis Code
monthly <- data |>mutate(Month =floor_date(as.Date(engagement_start), "month")) |>group_by(Month) |>summarise(n_engagements =n(),agreed_fee =sum(agreed_fee_ngn000),collected =sum(collected_amt_ngn000),avg_realisation =mean(realisation_percent),avg_collection =mean(collection_rate_percent) ) |>arrange(Month)fee_ts <-ts(monthly$agreed_fee,start =c(2024, 1), frequency =12)col_ts <-ts(monthly$collected,start =c(2024, 1), frequency =12)cat("Monthly periods:", length(fee_ts))
Figure 12: Monthly Collection Gap — Agreed Fee vs Collected
8.3.1 Management Interpretation
The firm’s monthly revenue is not growing, it fluctuates around a flat trend with a consistent collection gap of approximately 25% every month. Both patterns are actionable.
The ADF test confirmed the series is non-stationary at levels (p > 0.05) — a trend component is present. First differencing achieved stationarity (p < 0.05). The ACF and PACF plots on the differenced series show the lag structure for ARIMA specification; in this submission, Holt-Winters was selected as the forecasting model because the 24-period series length is below the reliable threshold for ARIMA estimation.
The series is non-stationary at levels — the ADF test returned p 0.05, confirming that the mean is not constant over time (a trend is present). First differencing removed the trend and produced a stationary series (p < 0.05). Stationarity matters for ARIMA because the model’s autoregressive and moving-average components assume that the statistical properties of the series — mean, variance, autocorrelation structure —do not change over time. A trending series breaks this assumption and produces spurious parameter estimates and unreliable forecasts.
Holt-Winters was preferred over ARIMA for this submission because it handles trend directly without pre-differencing and is more stable with limited data — but the ADF and ACF/PACF analysis are retained as ethodological rigour.
9 Integrated Findings
9.1 How the Five Analyses Connect
Show Analysis Code
tibble(Step =as.character(1:5),Technique =c("Classification (S5)","Explainability (S6)","Clustering (S7)","PCA (S8)","Time Series (S9)"),Question =c("Which engagements are High Value?","Why is an engagement High Value?","Which client groups exist and how do they behave?","What is the underlying portfolio structure?","Where is revenue heading and when does it peak or dip?" ),Key_Output =c("RF model; 95% accuracy; AUC 0.970","Cash collection = top driver (17.3% importance)","4 segments; Cluster C collection rate < 70%","4 PCs explain 80%+ variance; PC1 = Financial Scale","Flat trend; 25% monthly gap; Jan peak; Feb dip" ),Feeds_Into =c("Sections 6, 7, 8 — explains and contextualises the prediction","Identifies which variables to prioritise in cluster profiling","Cluster labels added to improve classification AUC","Validates cluster separation; provides executive portfolio view","Confirms revenue plateau; makes mix-shift strategy urgent" )) |>kable(col.names =c("Step","Technique","Question","Key Output","Feeds Into")) |>kable_styling(bootstrap_options =c("striped","hover","bordered"),full_width =TRUE, font_size =12) |>column_spec(1, bold =TRUE, width ="1cm") |>column_spec(2, bold =TRUE, width ="3cm")
The Analytical Chain — Five Techniques, One Diagnosis
Step
Technique
Question
Key Output
Feeds Into
1
Classification (S5)
Which engagements are High Value?
RF model; 95% accuracy; AUC 0.970
Sections 6, 7, 8 — explains and contextualises the prediction
2
Explainability (S6)
Why is an engagement High Value?
Cash collection = top driver (17.3% importance)
Identifies which variables to prioritise in cluster profiling
3
Clustering (S7)
Which client groups exist and how do they behave?
4 segments; Cluster C collection rate < 70%
Cluster labels added to improve classification AUC
Where is revenue heading and when does it peak or dip?
Flat trend; 25% monthly gap; Jan peak; Feb dip
Confirms revenue plateau; makes mix-shift strategy urgent
9.1.1 The Convergent Diagnosis
Five separate analyses, applied independently, converge on the same commercial diagnosis: Stransact is not under-performing on fee-setting or client acquisition — it is under-performing on revenue extraction from engagements it has already won.
The classifier shows that collected amount and collection rate are the two strongest predictors of high-value status — more influential than fee size, service line or client tier. The clustering analysis confirms this: Cluster C — 69 engagements, 35% of the portfolio — has a collection rate below 70% despite reasonable agreed fees.
The time series quantifies the monthly cost: approximately 25% of billed revenue sits uncollected in every period. Three techniques are pointing at the same problem from three different analytical angles.
9.2 Recommendation
Show Analysis Code
tibble(Technique =c("Classification","Explainability","Clustering","PCA","Time Series"),Evidence =c("95% accuracy; cash collection = top predictive feature","Collected_Amt and Collection_Rate account for 26% of model weight","Cluster C (n=69) has <70% collection rate and 30% HV rate","Financial Scale (PC1) separates HV from Standard in 2D space","Flat revenue trend; 25% monthly collection gap; Feb dip confirmed" ),Action =c("Deploy RF model as a pre-acceptance scoring tool for new proposals","Set minimum 85% collection rate as an engagement acceptance condition","Launch 30-day targeted collections intervention on all Cluster C work","Use PCA biplot quarterly as a partner portfolio health review tool","Set monthly collection targets; run proactive BD in January" )) |>kable(col.names =c("Technique","Evidence","Action It Supports")) |>kable_styling(bootstrap_options =c("striped","hover","bordered"),full_width =TRUE) |>column_spec(1, bold =TRUE, width ="2.5cm") |>column_spec(2, width ="6cm") |>column_spec(3, width ="6cm")
Five Analyses — One Recommendation
Technique
Evidence
Action It Supports
Classification
95% accuracy; cash collection = top predictive feature
Deploy RF model as a pre-acceptance scoring tool for new proposals
Explainability
Collected_Amt and Collection_Rate account for 26% of model weight
Set minimum 85% collection rate as an engagement acceptance condition
Clustering
Cluster C (n=69) has <70% collection rate and 30% HV rate
Launch 30-day targeted collections intervention on all Cluster C work
PCA
Financial Scale (PC1) separates HV from Standard in 2D space
Use PCA biplot quarterly as a partner portfolio health review tool
Time Series
Flat revenue trend; 25% monthly collection gap; Feb dip confirmed
Set monthly collection targets; run proactive BD in January
Stransact should implement a data-driven engagement quality framework — using the Random Forest classifier as a pre-acceptance screen, the four cluster profiles as a portfolio management tool, and monthly collection tracking as an early-warning system — with the primary objective of moving 20% of current Cluster C engagements into Cluster B commercial performance within 12 months, through targeted collections discipline, pricing review, and selective client portfolio rationalisation.
10 Limitations of the Study
Despite yielding commercially valuable insights into the drivers of high-value engagements at Stransact, several limitations could be noted with regard to the research process and findings;
To begin with, the study relied on a relatively small operational sample, of 200 engagement observations accumulated over a two-year period between January 2024 and December 2025. Although sufficient for exploratory data analysis, classification modelling, clustering and forecasting, this sample size will most likely be insufficient for model stability and predictive robustness over longer periods.
Professional services providers’ revenue are often affected by various factors such as economic cycles, regulatory deadlines and advisory needs which may have been better explored using a larger dataset spanning multiple business cycles.
In addition, the dataset consisted of operational records of a single professional services firm. As a result, the findings pertain specifically to the operational structure, pricing, services and client portfolio of Stransact. This limits the potential applicability to other companies in the market; While many of the insights may be applicable in comparable firms, the results cannot automatically be extended to all professional services firm without further investigation.
Third, several variables known to affect engagement value were missing from the operational records provided to the author. These include: indicators of client satisfaction, proposal conversion rates, relationship strength with partners, macroeconomic conditions, and demand shocks in specific sectors. The models in this case, focus on commercially observable operational patterns rather than the broader strategic context that drives client value creation.
Fourth, although the Random Forest classification model yielded impressive results, it’s explainability through SHAP analysis and variable importance remains probabilistic in nature. In other words, the models help identify variables related to commercially successful engagements, but they do not establish causation. For instance, the model indicates a very strong correlation between high collection rates and high-value engagements, but it does not prove that collections alone are responsible for the strategic value of engagements.
Furthrmore, due to the relatively short span of available monthly revenue records, the Holt-Winters and ARIMA models are likely sufficient for short-term operational forecasting. However, their forecast confidence intervals are relatively wide, which can be attributed to the relatively small number of observations used. A longer revenue history would yield more accurate trend estimates, seasonality patterns and forecasts.
In conclusion, the study was performed under the practical constraints of an MBA analytics project. Consequently, the analysis focused on efficient and interpretable machine learning models rather than computationally intensive deep learning and ensemble methods.
10.1 Further Work
Several avenues of research could be pursued in future findings.
To begin with, extending the observation period to five or more years would enable better analysis the dataset longitudinally and organizationally. It will help predict the revenue trends over time, recurring compliance cycles, client retention dynamics, and sensitivity to macroeconomic changes.
Similarly, the addition of operational records from other professional services firms would facilitate comparative analyses and increase the external validity of the models developed.
Future studies could also introduce more behavioural and relational variables, which should include: success rates of proposals, satisfaction levels of clients, engagement turnaround times, level of partner involvement, delayed payment patterns, and referral generation. These variables would help understand how commercial relationships develop within professional services firms.
In terms of modeling techniques, more advanced machine learning architectures could be deployed if more computing power and data becomes available. Some of the possible directions for future research include: gradient boosting machines (XGBoost and LightGBM), ensemble learning frameworks, bayesian forecasting models, neural network-based time series models, and survival models for client retention. These techniques may lead to increased predictive performance in larger datasets.
Furthermore, future projects could aim to implement real-time operational dashboards that are integrated directly into business development and engagement management processes. In contrast to this project, which focuses on retrospective reporting, the next steps could involve implementing tools to score ongoing engagements and prioritize clients based on revenue intelligence.
Another useful direction for further research would be integrating financial forecasting with workforce planning and capacity optimization models. As the study revealed, utilization, billing efficiency and collections were the key commercial drivers. Therefore, connecting revenue intelligence with workforce allocation could optimize consulting and compliance teams’ deployment.
Lastly, it would be useful to assess the organizational impact of introducing revenue intelligence systems. For example, it could evaluate if using predictive engagement scoring and client segmentation contribute to revenue growth, profitability, collections, client retention, and expansion of service lines.
References
Adubi, O. (2026). Anonymised Stransact Revenue Intelligence Engagement Dataset. Collected from Stransact, Lagos, Nigeria. Data available on request from the author.
Boehmke, B., & Greenwell, B. M. (2020). Hands-on machine learning with R. CRC Press.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer.
Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.
Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles in R. https://www.tidymodels.org/
Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble
Pedersen, T. L. (2024). patchwork: The composer of plots (R package version 1.2.0). https://CRAN.R-project.org/package=patchwork
Robinson, D., & Hayes, A. (2024). broom: Convert statistical analysis objects into tidy tibbles (R package version 1.0.6). https://CRAN.R-project.org/package=broom
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01
Yu, G. (2024). ggplotify: Convert plot to ggplot object (R package version 0.1.2). https://CRAN.R-project.org/package=ggplotify
Zwillinger, D., & Kokoska, S. (2000). CRC standard probability and statistics tables and formulae. CRC Press.
Appendix: AI Usage Statement
Claude (Anthropic) assisted with R code generation, error debugging, and document structuring during this project. Specifically, AI assistance was used to generate code chunks for the tidymodels classification pipeline, clustering visualisations using factoextra, PCA implementation using tidymodels recipes, and time series forecasting using the forecast package. AI tools also assisted with resolving package compatibility errors encountered during rendering.
All analytical decisions were made independently by the author. These include the choice of the five techniques and their justification within the Stransact business context, the construction and threshold selection for the High_Value_Engagement target variable, the interpretation of all model outputs and performance metrics, the naming and commercial interpretation of the four engagement clusters, and the integrated recommendation in Section 10. Every result was reviewed, validated and interpreted by the author before inclusion. The data was extracted, anonymised and prepared by the author from Stransact’s internal operating systems.