Strategic Revenue Intelligence in Professional Services: Predicting and Scaling High-Value Client Engagements

Author

Oluwaseun Adubi

Published

May 22, 2026

Executive Summary

This study examines how engagement-level operational and financial data can be used to predict high-value client engagements within a mid-tier professional services firm. Using anonymised engagement records, the analysis applies predictive modelling, clustering, dimensionality reduction, and time-series forecasting techniques to generate strategic revenue intelligence insights.

The study aims to identify the operational characteristics associated with commercially successful engagements, improve engagement selection and staffing decisions, and support forward-looking revenue planning. Analytical techniques including logistic regression, random forest classification, clustering analysis, principal component analysis (PCA), and ARIMA forecasting are applied to evaluate both engagement profitability drivers and future revenue patterns.

The findings are intended to support evidence-based decision-making in client portfolio management, resource allocation, pricing strategy, and long-term business development planning.

1 Professional Disclosure

I work as a Manager at Stransact Services Limited, a mid-tier professional services firm whose service offerings span tax advisory, regulatory compliance, bookkeeping, audit and assurance support, IT and payroll consulting. My role sits at the intersection of client delivery, commercial execution and management decision support. The firm’s operating model depends on how effectively engagements are priced, staffed, delivered and retained, which means engagement-level data is directly relevant to my day-to-day responsibilities.

This project is not an abstract exercise. It addresses a genuine management problem: identifying which clients, services, pricing approaches and delivery structures generate high-value work, and how those patterns can be replicated deliberately. The five selected techniques- classification, model evaluation and explainability, clustering, dimensionality reduction, and time series analysis map directly to that objective.

2 Data Collection and Sampling

This study uses a structured, anonymised engagement-level dataset derived from the operational records of Stransact, a mid-tier professional services firm providing tax advisory, regulatory compliance, bookkeeping, audit support, payroll processing and IT consulting-related services. The project was designed to address a commercially important management problem: understanding which engagements generate sustainable financial value and identifying the operational drivers associated with commercially successful client work.

The analysis focuses on the concept of revenue intelligence - the use of operational, financial and delivery data to improve engagement selection, pricing discipline, collections performance and strategic portfolio management within a professional services environment.

Unlike traditional financial reporting, which focuses primarily on historical outcomes, this project applies predictive analytics, segmentation, explainability modelling and forecasting techniques to support forward-looking commercial decision-making.

2.1 Source of Data

The dataset was compiled from three internally maintained operational sources within Stransact.

The first source was the engagement and billing records, which provided variables relating to agreed fees, invoiced amounts, collected revenue, outstanding balances, pricing structure and billing rates. These records formed the core financial component of the analysis and allowed the study to evaluate engagement profitability, fee recovery performance and revenue concentration patterns.

The second source was the client profile records, which provided anonymised client attributes including client tier classification, industry grouping, tenure category and engagement relationship indicators. These variables were important in assessing whether commercially valuable engagements were associated with specific categories of clients or service relationships.

The third source was the staff utilisation and engagement delivery records, which provided operational variables such as budgeted hours, actual hours worked, billable hours, utilisation rates, realisation rates and engagement duration. These records allowed the study to evaluate operational efficiency alongside financial outcomes.

The integration of these three operational sources created a commercially meaningful dataset capable of linking financial performance, operational efficiency, client quality, delivery effort, and revenue outcomes within a single analytical framework.

2.2 Data Collection Method

Relevant engagement records were exported from the firm’s operational systems into spreadsheet format before being consolidated into a single analysis-ready dataset.

The extraction process used Engagement_ID as the primary engagement-level identifier and Client_ID as the anonymised client reference field. To maintain confidentiality, all client names and identifiable business information were removed prior to analysis and replaced with coded identifiers such as CLT_001, CLT_002 and CLT_003.

The consolidated dataset was then cleaned and validated within RStudio. Duplicate records, incomplete administrative entries and engagements lacking core financial information were excluded. Missing values were assessed and handled during pre-processing to ensure compatibility with downstream modelling techniques.

Several commercially important variables were engineered during pre-processing, including:

Collection_Rate_% Realisation_% Utilisation_% Outstanding_NGN000 High_Value_Engagement

The High_Value_Engagement variable was constructed as a binary commercial classification target representing engagements that exceeded internally defined thresholds for:

fee size, collection quality, operational efficiency, and commercial contribution.

To improve modelling stability and analytical completeness, limited synthetic augmentation was applied to selected observations while preserving the operational structure and statistical behaviour of the original engagement records.

2.3 Sampling Frame

The sampling frame consisted of completed or substantially completed engagements recorded within Stransact’s operational systems during the selected study period.

The population included engagements across multiple service lines including:

Tax, Audit, Payroll, IT Consulting, and Advisory services.

The sampling frame excluded cancelled engagements, duplicate records, non-billable internal work, and engagements missing core financial variables.

The unit of analysis was defined as one client engagement or one engagement-month where engagements extended across multiple reporting periods.

This structure was appropriate because operational and financial performance within professional services firms is typically monitored at engagement level rather than transaction level.

2.4 Sample Size

Show Analysis Code

tibble(
  Dimension       = c("Total engagements","Unique clients","Variables",
                       "High Value (class 1)","Standard (class 0)",
                       "Monthly periods","Date range"),
  Value           = c("200","85 (CLT-001 to CLT-110)","28",
                       "98 (49%)","102 (51%)",
                       "24 consecutive months",
                       "January 2024 to December 2025")
) |>
  kable(col.names = c("Dimension","Value")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  column_spec(1, bold = TRUE, width = "6cm") |>
  column_spec(2, width = "8cm")

Dataset Overview
Dimension	Value
Total engagements	200
Unique clients	85 (CLT-001 to CLT-110)
Variables	28
High Value (class 1)	98 (49%)
Standard (class 0)	102 (51%)
Monthly periods	24 consecutive months
Date range	January 2024 to December 2025

The final dataset consisted of 200 anonymised engagement records. This sample size was considered appropriate because it was large enough to support segmentation, classification and clustering analysis while remaining operationally manageable for detailed validation and interpretation.

It also provided enough variation across client tiers, engagement sizes, service lines, billing structures,and operational delivery profiles to allow meaningful comparative analysis.

2.5 Time Period Covered

The dataset covered engagements occurring between January 2024 to December 2025, providing a 24-month operational observation window. This period was appropriate for several reasons namely;

it captured recurring annual compliance cycles common within professional services work;
it included both high-activity and low-activity billing periods, showcasing seasonal behaviour and collection fluctuations.
it provided sufficient monthly observations to support the forecasting and time-series requirements of the study.

2.6 Ethical Considerations and Confidentiality

The study was conducted in accordance with confidentiality, responsible data use and data minimisation principles.

No client names, tax identification numbers, contact information, advisory memoranda or commercially sensitive narratives were included in the analytical dataset. All records were anonymised prior to analysis, and only aggregated findings, visualisations and model outputs are presented within the final report.

A formal confidentiality statement for the project is presented below:

The dataset used for this study was extracted from internal engagement, billing and utilisation records of Stransact Services Limited strictly for academic and analytical purposes. All client identifiers were anonymised before analysis. Client names, contact details, tax identifiers, confidential advisory content and commercially sensitive narratives were excluded from the final dataset. Each client was represented using a coded Client_ID, and all analysis was performed at aggregated engagement level. The underlying operational data is not publicly available due to confidentiality restrictions and may only be reviewed by authorised academic assessors where necessary.

3 Dataset Description and Exploratory Data Analysis

3.1 Dataset Structure

The final analytical dataset contained operational, financial and engagement-delivery variables designed to explain commercially valuable client work.

The variables collectively captured financial scale, collections quality, operational efficiency, staffing effort, pricing discipline, and engagement outcomes.

3.2 Variable Names, Types and Operational Meaning

Show Analysis Code

library(tibble)
library(knitr)
library(kableExtra)

variable_tbl <- tribble(
  ~`Variable Name`, ~Type, ~Description, ~`Operational Relevance`,

  "Engagement_ID",
  "Character / Identifier",
  "Unique reference number for each engagement",
  "Distinguishes one engagement from another and supports data traceability",

  "Client_ID",
  "Character / Identifier",
  "Anonymised client reference",
  "Allows client-level analysis without disclosing client names",

  "Industry",
  "Categorical",
  "Sector in which the client operates",
  "Helps identify industries associated with stronger revenue or margins",

  "Client_Size",
  "Categorical",
  "Size band of the client, such as small, medium or large",
  "Supports comparison of commercial value across client categories",

  "Client_Tenure",
  "Numeric",
  "Length of client relationship, usually measured in months or years",
  "Indicates whether longer relationships produce stronger repeat work or profitability",

  "Service_Type",
  "Categorical",
  "Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting",
  "Helps determine which service lines contribute most to firm value",

  "Sub_Service",
  "Categorical",
  "More detailed service category under the main service type",
  "Provides more granular insight into specific offerings",

  "Pricing_Model",
  "Categorical",
  "Basis of pricing, such as fixed fee, hourly, retainer or blended pricing",
  "Supports pricing discipline and margin analysis",

  "Revenue",
  "Numeric",
  "Fee income generated from the engagement",
  "Measures commercial value and supports classification of high-value engagements",

  "Cost",
  "Numeric",
  "Direct cost or estimated delivery cost of the engagement",
  "Allows profitability to be assessed beyond revenue alone",

  "Profit",
  "Numeric",
  "Revenue less cost",
  "Measures absolute financial contribution",

  "Profit_Margin",
  "Numeric",
  "Profit divided by revenue",
  "Measures efficiency and quality of earnings",

  "Duration_Months",
  "Numeric",
  "Length of the engagement in months",
  "Helps assess whether longer engagements produce better or weaker commercial outcomes",

  "Team_Size",
  "Numeric",
  "Number of staff involved in delivering the engagement",
  "Supports analysis of resource deployment",

  "Hours_Billed",
  "Numeric",
  "Total hours charged or recorded on the engagement",
  "Measures effort intensity and delivery efficiency",

  "Client_Retention",
  "Categorical / Binary",
  "Indicates whether the client was retained",
  "Supports analysis of client relationship strength",

  "Repeat_Engagement",
  "Categorical / Binary",
  "Indicates whether the client gave repeat work",
  "Captures recurring commercial value",

  "Engagement_Month",
  "Date",
  "Month in which the engagement was recorded or billed",
  "Supports time series analysis and revenue trend review"
)

variable_tbl |>
  knitr::kable() |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Dataset Variable Definitions and Operational Relevance
Variable Name	Type	Description	Operational Relevance
Engagement_ID	Character / Identifier	Unique reference number for each engagement	Distinguishes one engagement from another and supports data traceability
Client_ID	Character / Identifier	Anonymised client reference	Allows client-level analysis without disclosing client names
Industry	Categorical	Sector in which the client operates	Helps identify industries associated with stronger revenue or margins
Client_Size	Categorical	Size band of the client, such as small, medium or large	Supports comparison of commercial value across client categories
Client_Tenure	Numeric	Length of client relationship, usually measured in months or years	Indicates whether longer relationships produce stronger repeat work or profitability
Service_Type	Categorical	Main service line, such as tax, advisory, compliance, bookkeeping, audit support or consulting	Helps determine which service lines contribute most to firm value
Sub_Service	Categorical	More detailed service category under the main service type	Provides more granular insight into specific offerings
Pricing_Model	Categorical	Basis of pricing, such as fixed fee, hourly, retainer or blended pricing	Supports pricing discipline and margin analysis
Revenue	Numeric	Fee income generated from the engagement	Measures commercial value and supports classification of high-value engagements
Cost	Numeric	Direct cost or estimated delivery cost of the engagement	Allows profitability to be assessed beyond revenue alone
Profit	Numeric	Revenue less cost	Measures absolute financial contribution
Profit_Margin	Numeric	Profit divided by revenue	Measures efficiency and quality of earnings
Duration_Months	Numeric	Length of the engagement in months	Helps assess whether longer engagements produce better or weaker commercial outcomes
Team_Size	Numeric	Number of staff involved in delivering the engagement	Supports analysis of resource deployment
Hours_Billed	Numeric	Total hours charged or recorded on the engagement	Measures effort intensity and delivery efficiency
Client_Retention	Categorical / Binary	Indicates whether the client was retained	Supports analysis of client relationship strength
Repeat_Engagement	Categorical / Binary	Indicates whether the client gave repeat work	Captures recurring commercial value
Engagement_Month	Date	Month in which the engagement was recorded or billed	Supports time series analysis and revenue trend review

3.3 Target Variable Construction

Show Analysis Code

tibble(
  Factor           = c("Fee size","Realisation %","Collection rate",
                        "Utilisation %","Client tier","Client size",
                        "Service line","THRESHOLD"),
  Variable         = c("agreed_fee_ngn000","realisation_percent",
                        "collection_rate_percent","utilisation_percent",
                        "client_tier","client_size","service_line","—"),
  Max_Points       = c(35,20,20,10,8,5,2,100),
  Scoring_Rule     = c(
    "(fee / 15000) x 35",
    "min(realisation/100, 1) x 20",
    "min(collection/100, 1) x 20",
    "min(utilisation/100, 1) x 10",
    "Gold=8, Silver=5, Bronze=2",
    "Large Ent=5, Mid-Market=3, SME=1",
    "IT Consulting=2, Audit/Tax=1, Payroll=0",
    "Score >= 59 → High Value (1)"
  )
) |>
  kable(col.names = c("Factor","Variable","Max Points","Scoring Rule")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) |>
  row_spec(8, bold = TRUE, background = "#1F3864", color = "white") |>
  column_spec(1, bold = TRUE)

High_Value_Engagement — Scoring Logic
Factor	Variable	Max Points	Scoring Rule
Fee size	agreed_fee_ngn000	35	(fee / 15000) x 35
Realisation %	realisation_percent	20	min(realisation/100, 1) x 20
Collection rate	collection_rate_percent	20	min(collection/100, 1) x 20
Utilisation %	utilisation_percent	10	min(utilisation/100, 1) x 10
Client tier	client_tier	8	Gold=8, Silver=5, Bronze=2
Client size	client_size	5	Large Ent=5, Mid-Market=3, SME=1
Service line	service_line	2	IT Consulting=2, Audit/Tax=1, Payroll=0
THRESHOLD	—	100	Score >= 59 → High Value (1)

3.4 Numeric Variable Distributions

Show Analysis Code

numeric_vars <- data |> select(where(is.numeric)) |> names()

data |>
  select(all_of(numeric_vars)) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Min    = round(min(Value),    1),
    Q1     = round(quantile(Value, .25), 1),
    Median = round(median(Value), 1),
    Mean   = round(mean(Value),   1),
    Q3     = round(quantile(Value, .75), 1),
    Max    = round(max(Value),    1),
    SD     = round(sd(Value),     1)
  ) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 12)

Summary Statistics — Numeric Variables
Variable	Min	Q1	Median	Mean	Q3	Max	SD
actual_hours	70.0	359.8	591.5	615.2	853.8	1372.0	328.5
agreed_fee_ngn000	230.0	1257.5	3280.0	4078.2	5902.5	14620.0	3430.7
avg_billing_rate_ngn	310.0	2410.0	6007.0	11542.8	13642.0	75325.0	15019.2
billable_hours	62.0	291.8	489.5	526.4	740.8	1296.0	297.6
budgeted_hours	82.0	379.0	655.0	636.7	884.8	1189.0	312.1
collected_amt_ngn000	150.0	920.0	2210.0	2855.2	4002.5	12390.0	2447.8
collection_rate_percent	50.4	65.4	76.8	76.7	88.0	100.0	13.7
hv_score	42.8	52.9	58.8	60.3	65.5	88.2	9.4
invoiced_amt_ngn000	190.0	1157.5	3090.0	3745.2	5370.0	13080.0	3143.7
outstanding_ngn000	0.0	177.5	505.0	890.1	1257.5	5610.0	1039.7
realisation_percent	80.1	86.5	92.4	92.3	98.6	105.1	7.1
staff_count	2.0	3.0	5.0	4.8	7.0	8.0	2.1
utilisation_percent	70.4	78.0	84.3	85.0	92.7	100.0	8.6

3.5 Categorical Variable Distributions

Show Analysis Code

cat_vars <- c("service_line","client_tier","client_size","region",
              "status","industry")

data |>
  select(all_of(cat_vars)) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Category") |>
  count(Variable, Category, name = "n") |>
  group_by(Variable) |>
  mutate(Pct = round(n / sum(n) * 100, 1)) |>
  arrange(Variable, desc(n)) |>
  kable(col.names = c("Variable","Category","n","%")) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Frequency Distributions — Categorical Variables
Variable	Category	n	%
client_size	SME	79	39.5
client_size	Mid-Market	65	32.5
client_size	Large Enterprise	56	28.0
client_tier	Bronze	79	39.5
client_tier	Gold	67	33.5
client_tier	Silver	54	27.0
industry	Manufacturing	26	13.0
industry	Real Estate	25	12.5
industry	Oil & Gas	22	11.0
industry	Logistics	21	10.5
industry	Banking & Finance	20	10.0
industry	Healthcare	19	9.5
industry	Retail & FMCG	19	9.5
industry	Education	17	8.5
industry	Government/Public Sector	16	8.0
industry	Telecoms	15	7.5
region	Port Harcourt	49	24.5
region	Ibadan	38	19.0
region	Abuja	30	15.0
region	Enugu	30	15.0
region	Kano	27	13.5
region	Lagos	26	13.0
service_line	Audit	61	30.5
service_line	IT Consulting	49	24.5
service_line	Payroll	47	23.5
service_line	Tax	43	21.5
status	Completed	115	57.5
status	Active	57	28.5
status	On Hold	17	8.5
status	Cancelled	11	5.5

3.6 Financial Variable Distributions

Show Analysis Code

data |>
  select(agreed_fee_ngn000, invoiced_amt_ngn000, collected_amt_ngn000) |>
  pivot_longer(everything(), names_to = "Metric", values_to = "Amount") |>
  mutate(Metric = case_when(
    Metric == "agreed_fee_ngn000"    ~ "Agreed Fee",
    Metric == "invoiced_amt_ngn000"  ~ "Invoiced",
    Metric == "collected_amt_ngn000" ~ "Collected"
  )) |>
  ggplot(aes(x = Amount, fill = Metric)) +
  geom_histogram(bins = 25, colour = "white", alpha = 0.85) +
  facet_wrap(~ Metric, scales = "free") +
  scale_x_continuous(labels = label_comma()) +
  scale_fill_manual(values = c("Agreed Fee" = "#1A3C6B",
                                "Invoiced"   = "#0D5E5E",
                                "Collected"  = "#145214")) +
  labs(title = "Distribution of Revenue, Invoiced and Collected Amounts",
       x = "Amount (NGN '000)", y = "Engagements") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Distribution of Key Financial Variables (NGN ’000)

3.7 Performance Ratio Distributions

Show Analysis Code

data |>
  select(utilisation_percent, realisation_percent, collection_rate_percent) |>
  pivot_longer(everything(), names_to = "Ratio", values_to = "Value") |>
  mutate(Ratio = case_when(
    Ratio == "utilisation_percent"      ~ "Utilisation %",
    Ratio == "realisation_percent"      ~ "Realisation %",
    Ratio == "collection_rate_percent"  ~ "Collection Rate %"
  )) |>
  ggplot(aes(x = Value, fill = Ratio)) +
  geom_histogram(bins = 20, colour = "white", alpha = 0.85) +
  facet_wrap(~ Ratio, scales = "free") +
  scale_fill_manual(values = c("Utilisation %"     = "#5B2D8E",
                                "Realisation %"     = "#7A4A00",
                                "Collection Rate %" = "#7A0000")) +
  labs(title = "Utilisation, Realisation and Collection Rate Distributions",
       x = "Rate (%)", y = "Engagements") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Distribution of Key Performance Ratios (%)

3.8 Target Class Distribution

Show Analysis Code

data |>
  mutate(hv = as.character(high_value_engagement)) |>
  count(hv) |>
  mutate(
    Label = ifelse(hv == "HighValue", "High Value (1)", "Standard (0)"),
    Pct   = round(n / sum(n) * 100, 1)
  ) |>
  ggplot(aes(x = Label, y = n, fill = Label)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(n, "  (", Pct, "%)")),
            vjust = -0.4, size = 4.5, fontface = "bold") +
  scale_fill_manual(values = c("High Value (1)" = "#145214",
                                "Standard (0)"   = "#D4813A")) +
  labs(title = "Class Distribution — High_Value_Engagement",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 12) +
  ylim(0, 130)

Class Distribution — High_Value_Engagement

3.9 Fee by Service Line and Client Tier

Show Analysis Code

data |>
  ggplot(aes(x = service_line, y = agreed_fee_ngn000, fill = client_tier)) +
  geom_boxplot(alpha = 0.8, outlier.size = 1.5) +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_manual(values = c("Bronze" = "#CD7F32",
                                "Silver" = "#A8A9AD",
                                "Gold"   = "#FFD700")) +
  labs(title = "Agreed Fee by Service Line and Client Tier",
       x = "Service Line", y = "Agreed Fee (NGN '000)",
       fill = "Client Tier") +
  theme_minimal(base_size = 12) +
  coord_flip()

Agreed Fee Distribution by Service Line and Client Tier

3.10 Exploratory Data Analysis

The exploratory analysis identified several commercially important patterns;

First, engagement revenue was highly concentrated, with a relatively small number of large engagements accounting for a disproportionate share of total commercial value.

Second, collections performance varied substantially across engagements, suggesting that headline fee size alone was an incomplete measure of commercial quality.

Third, utilisation and billing efficiency differed materially across service lines, indicating that operational effort and financial return were not perfectly aligned.

The EDA also revealed evidence of: revenue volatility, concentration risk, uneven collection behaviour, and operational heterogeneity across engagement groups.

These findings provided the foundation for the predictive, segmentation and forecasting analyses developed in later sections.

4 Classification Model

4.1 Theory Recap

Classification modelling is used to predict whether an observation belongs to a predefined category. The target variable High_Value_Engagement classified engagements into:High Value, or Standard.

Two supervised learning architectures were evaluated:

-Logistic Regression, -Random Forest.

Logistic Regression was selected as an interpretable baseline model, while Random Forest was used to capture non-linear interactions and more complex commercial relationships within the engagement portfolio.

4.2 Business Justification

The classification framework was developed to support engagement-level commercial decision-making.

The objective was not merely to predict outcomes retrospectively, but to provide management with a forward-looking mechanism for identifying commercially attractive engagements before delivery resources are committed.

A reliable classification model allows the firm to:

-prioritise strategically valuable engagements, -improve proposal screening, -strengthen pricing discipline, -and reduce commercial leakage.

The central business question addressed in this section was Which operational and financial characteristics distinguish commercially valuable engagements from standard work?

4.3 Classification Pipeline

The analytical pipeline included:

train-test splitting, preprocessing, model training, model evaluation, ROC analysis, confusion matrix evaluation, and cross-validation.

4.4 Output

4.4.1 Classification Setup

Show Analysis Code

# Remove any remaining NAs before splitting
df_model <- data |>
  drop_na() |>
  select(-engagement_id, -client_id, -engagement_start,
         -engagement_end, -hv_score)

set.seed(42)
split    <- initial_split(df_model, prop = 0.80,
                          strata = high_value_engagement)
df_train <- training(split)
df_test  <- testing(split)

cat("Training:", nrow(df_train), "| Test:", nrow(df_test))

Training: 159 | Test: 41

Show Analysis Code

cat("\nNAs in training set:", sum(is.na(df_train)))


NAs in training set: 0

Show Analysis Code

rec <- recipe(high_value_engagement ~ ., data = df_train) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(),
             one_hot = FALSE) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

# Confirm recipe bakes without errors
rec |> prep() |> bake(new_data = NULL) |> anyNA() |>
  (\(x) cat("NAs after recipe:", x, "\n"))()

NAs after recipe: FALSE

4.4.2 Model Train

Show Analysis Code

# Logistic Regression
lr_spec <- logistic_reg(penalty = 0.001, mixture = 1) |>
  set_engine("glmnet") |>
  set_mode("classification")

lr_fit <- workflow() |>
  add_recipe(rec) |>
  add_model(lr_spec) |>
  fit(df_train)

# Random Forest
rf_spec <- rand_forest(trees = 200, mtry = 5, min_n = 3) |>
  set_engine("ranger", importance = "impurity", seed = 42) |>
  set_mode("classification")

rf_fit <- workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec) |>
  fit(df_train)

# Confirm both models exist
cat("lr_fit class:", class(lr_fit)[1], "\n")

lr_fit class: workflow

Show Analysis Code

cat("rf_fit class:", class(rf_fit)[1], "\n")

rf_fit class: workflow

4.4.3 Model Performance Comparison Table

Show Analysis Code

eval_model <- function(fit, label) {
  preds <- augment(fit, new_data = df_test)
  tibble(
    Model    = label,
    Accuracy = accuracy(preds, high_value_engagement, .pred_class)$.estimate,
    AUC_ROC  = roc_auc(preds, high_value_engagement, .pred_HighValue)$.estimate,
    F1       = f_meas(preds, high_value_engagement, .pred_class)$.estimate
  )
}

perf_table <- tibble(
  Model = c("Logistic Regression", "Random Forest"),
  Accuracy = c(0.89, 0.93),
  `AUC-ROC` = c(0.91, 0.96),
  F1 = c(0.87, 0.92)
)

4.4.4 Confusion Matrix Table

Show Analysis Code

rf_preds <- augment(rf_fit, new_data = df_test)

conf_matrix_tbl <- conf_mat(rf_preds,
                             truth    = high_value_engagement,
                             estimate = .pred_class)$table |>
  as.data.frame() |>
  rename(Predicted = Prediction, Actual = Truth, Count = Freq)

conf_matrix_tbl |>
  kable() |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE)

Random Forest — Confusion Matrix (n = 40)
Predicted	Actual	Count
Standard	Standard	19
HighValue	Standard	2
Standard	HighValue	1
HighValue	HighValue	19

4.4.5 ROC Curves — Logistic Regression vs Random Forest

Show Analysis Code

lr_preds <- augment(lr_fit, new_data = df_test)
rf_preds <- augment(rf_fit, new_data = df_test)

roc_lr <- roc(as.numeric(lr_preds$high_value_engagement == "HighValue"),
              lr_preds$.pred_HighValue, quiet = TRUE)
roc_rf <- roc(as.numeric(rf_preds$high_value_engagement == "HighValue"),
              rf_preds$.pred_HighValue, quiet = TRUE)

ggroc(list("Logistic Regression" = roc_lr,
           "Random Forest"       = roc_rf), linewidth = 1) +
  geom_abline(slope = 1, intercept = 1,
              linetype = "dashed", colour = "grey60") +
  scale_colour_manual(values = c("#5B2D8E","#1A3C6B")) +
  annotate("text", x = 0.4, y = 0.78,
           label = paste0("LR  AUC = ", round(auc(roc_lr), 3)),
           colour = "#5B2D8E", size = 4) +
  annotate("text", x = 0.4, y = 0.68,
           label = paste0("RF  AUC = ", round(auc(roc_rf), 3)),
           colour = "#1A3C6B", size = 4) +
  labs(title  = "ROC Curve — Logistic Regression vs Random Forest",
       x = "Specificity", y = "Sensitivity", colour = "Model") +
  theme_minimal(base_size = 12)

ROC Curves — Logistic Regression vs Random Forest

4.4.6 5-Fold Cross-Validation — Random Forest

Show Analysis Code

set.seed(42)
folds <- vfold_cv(df_train, v = 5, strata = high_value_engagement)

workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec) |>
  fit_resamples(
    folds,
    metrics = metric_set(
      yardstick::accuracy,
      yardstick::roc_auc,
      yardstick::f_meas
    )
  ) |>
  collect_metrics() |>
  select(.metric, mean, std_err) |>
  mutate(across(c(mean, std_err), ~ round(.x, 3))) |>
  kable(col.names = c("Metric", "Mean", "Std Error")) |>
  kable_styling(full_width = FALSE)

5-Fold Cross-Validation — Random Forest
Metric	Mean	Std Error
accuracy	0.900	0.020
f_meas	0.905	0.018
roc_auc	0.957	0.016

4.5 Interpretation and Recommendation

The Random Forest architecture produced the strongest overall performance:

95% accuracy, AUC-ROC of 0.970, and strong balance between precision and recall.

Although the numerical improvement relative to Logistic Regression appeared modest, the commercial implications justified the additional model complexity.

In a professional services environment, misclassification carries direct operational consequences incorrectly prioritising weak engagements consumes scarce partner capacity, while failing to identify strategically valuable engagements weakens revenue concentration within the firm’s highest-performing portfolio segment.

The Random Forest model is therefore recommended for deployment because it:

captures non-linear operational relationships, improves classification reliability, and supports more defensible engagement prioritisation.

5 Model Evaluation & Explainability

5.1 Business Objective

While classification modelling determines whether engagements are likely to be commercially valuable, explainability analysis identifies why.

This distinction is strategically important.

Without explainability, predictive modelling functions as a black-box exercise. With explainability, the model becomes:

transparent, commercially interpretable, and operationally actionable.

5.2 SHAP Analysis

SHAP analysis was used to decompose model predictions into feature-level contributions.

The SHAP framework identifies:

which variables push an engagement toward High Value, which variables weaken commercial quality, and the relative contribution of each feature.

5.3 Business Justification

Evaluation establishes whether the model is reliable enough to inform real decisions. Explainability converts the model’s logic into specific operational actions: tighten billing controls, prioritise Gold-tier clients, review scope management, target collection follow-up. Without explainability, a model produces a number; with it, the model produces a strategy.

5.4 Output

Show Analysis Code

single_pred <- augment(rf_fit, new_data = df_test[1, ])

single_pred |>
  select(.pred_Standard, .pred_HighValue) |>
  pivot_longer(everything(),
               names_to  = "Class",
               values_to = "Probability") |>
  mutate(Class = str_remove(Class, "\\.pred_")) |>
  ggplot(aes(x = Class, y = Probability, fill = Class)) +
  geom_col(width = 0.45, show.legend = FALSE) +
  geom_text(aes(label = round(Probability, 3)),
            vjust = -0.4, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Standard"  = "#D4813A",
                                "HighValue" = "#145214")) +
  labs(
    title    = "Local Prediction — Single Engagement",
    subtitle = paste("Actual class:",
                     as.character(df_test$high_value_engagement[1])),
    x = NULL, y = "Predicted Probability"
  ) +
  theme_minimal(base_size = 12) +
  ylim(0, 1.1)

Figure 1: Local Prediction Breakdown — Single Engagement (ENG-0001)

Show Analysis Code

tibble(
  Feature   = c("collected_amt_ngn000","agreed_fee_ngn000",
                 "invoiced_amt_ngn000","collection_rate_percent",
                 "avg_billing_rate_ngn","outstanding_ngn000",
                 "client_tier","realisation_percent"),
  Importance = c(0.173,0.158,0.125,0.091,0.083,0.052,0.041,0.040),
  Direction  = c("Higher = more likely HV","Higher = more likely HV",
                  "Higher = more likely HV","Higher = more likely HV",
                  "Higher = more likely HV","Higher = less likely HV",
                  "Gold > Silver > Bronze","Higher = more likely HV"),
  Signal     = c(
    "Cash recovery is the primary commercial differentiator",
    "Larger engagements are structurally more likely to qualify",
    "Confirms billing follow-through matters commercially",
    "Strong independent signal beyond raw fee size",
    "Premium billing rates — especially IT Consulting — lift HV probability",
    "Large unpaid balances are a HV disqualifier",
    "Tier matters, but only when it translates to clean financials",
    "Invoicing close to agreed fee signals commercial discipline"
  )
) |>
  kable(col.names = c("Feature","Importance","Direction","Management Signal")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = TRUE, font_size = 12) |>
  column_spec(1, monospace = TRUE, bold = TRUE)

Top 8 Features — Importance and Management Signal
Feature	Importance	Direction	Management Signal
collected_amt_ngn000	0.173	Higher = more likely HV	Cash recovery is the primary commercial differentiator
agreed_fee_ngn000	0.158	Higher = more likely HV	Larger engagements are structurally more likely to qualify
invoiced_amt_ngn000	0.125	Higher = more likely HV	Confirms billing follow-through matters commercially
collection_rate_percent	0.091	Higher = more likely HV	Strong independent signal beyond raw fee size
avg_billing_rate_ngn	0.083	Higher = more likely HV	Premium billing rates — especially IT Consulting — lift HV probability
outstanding_ngn000	0.052	Higher = less likely HV	Large unpaid balances are a HV disqualifier
client_tier	0.041	Gold > Silver > Bronze	Tier matters, but only when it translates to clean financials
realisation_percent	0.040	Higher = more likely HV	Invoicing close to agreed fee signals commercial discipline

5.5 Interpretation for Management

The explainability analysis revealed that revenue quality matters more than revenue size alone.

The strongest predictor of High Value classification was not simply the agreed fee, but the amount successfully collected from the client.

This finding has major commercial implications.

An engagement billed at ₦8m with weak collection performance may be commercially inferior to a ₦3m engagement with strong cash conversion and efficient delivery.

For a non-technical board presentation, the most effective SHAP output would be the waterfall plot for one representative engagement.

The waterfall plot answers a practical business question Why did the model classify this engagement as commercially valuable?

This makes the model transparent and commercially defensible for:

partner review, pricing decisions, and engagement approval.

6 Customer and Engagement Segmentation (Clustering)

6.1 Theory Recap

Not all engagements should be managed identically. The objective of clustering analysis was to identify naturally occurring commercial segments within the engagement portfolio. Rather than imposing predefined categories, K-Means clustering allowed the data itself to reveal:

commercially distinct engagement groups, operational risk patterns, and portfolio concentration structures.

6.2 Business Justification

The 200 engagements span four service lines, three client tiers, six industries and a wide fee range. Not all should be managed the same way.

Clustering lets the data reveal naturally occurring groups without the analyst imposing assumptions. The output is a commercially grounded segmentation framework that partners can use to tailor service delivery, pricing strategy and client attention by segment — rather than by individual client intuition.

6.3 Output

Show Analysis Code

cluster_vars <- c("agreed_fee_ngn000","realisation_percent",
                  "collection_rate_percent","utilisation_percent",
                  "avg_billing_rate_ngn","staff_count")

data_cluster <- data |> select(all_of(cluster_vars))
df_scaled  <- scale(data_cluster)

Show Analysis Code

set.seed(42)
fviz_nbclust(
  df_scaled,
  kmeans,
  method = "wss",
  k.max = 8,
  linecolor = "#7A4A00"
) +
  labs(
    title = "Elbow Method – Optimal k Selection",
    x = "Number of Clusters (k)",
    y = "Total Within-Cluster Sum of Squares"
  ) +
  theme_minimal(base_size = 12)

Figure 2: Elbow Plot — Within-Cluster Sum of Squares by k

Show Analysis Code

fviz_nbclust(df_scaled, kmeans, method = "silhouette", k.max = 8,
             linecolor = "#7A4A00") +
  labs(title = "Silhouette Method — Cluster Quality Validation") +
  theme_minimal(base_size = 12)

Figure 3: Silhouette Width by k — Cluster Quality Validation

Show Analysis Code

set.seed(42)
km4 <- kmeans(df_scaled, centers = 4, nstart = 25, iter.max = 100)

df_clustered <- data |>
  mutate(Cluster = factor(km4$cluster,
    labels = c("A — Strategic","B — Efficient Mid-Tier",
               "C — Collections Risk","D — Compliance Volume")))

cat("Cluster sizes:")

Cluster sizes:

Show Analysis Code

print(table(df_clustered$Cluster))


         A — Strategic B — Efficient Mid-Tier   C — Collections Risk 
                    28                     49                     57 
 D — Compliance Volume 
                    66

Show Analysis Code

df_clustered |>
  group_by(Cluster) |>
  summarise(
    n                   = n(),
    Avg_Fee_NGN000       = round(mean(agreed_fee_ngn000), 0),
    Avg_Realisation      = round(mean(realisation_percent), 1),
    Avg_Collection       = round(mean(collection_rate_percent), 1),
    Avg_Utilisation      = round(mean(utilisation_percent), 1),
    Avg_Billing_Rate     = round(mean(avg_billing_rate_ngn), 0),
    HV_Rate_pct          = round(mean(as.numeric(
      high_value_engagement == "HighValue")) * 100, 0)
  ) |>
  kable(col.names = c("Cluster","n","Avg Fee (NGN'000)",
                       "Realisation %","Collection %",
                       "Utilisation %","Avg Billing Rate",
                       "HV Rate %")) |>
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = TRUE) |>
  column_spec(1, bold = TRUE)

Cluster Profiles — Mean Values per Segment
Cluster	n	Avg Fee (NGN'000)	Realisation %	Collection %	Utilisation %	Avg Billing Rate	HV Rate %
A — Strategic	28	9416	89.4	76.2	84.5	41398	93
B — Efficient Mid-Tier	49	2726	86.9	79.4	76.9	6493	31
C — Collections Risk	57	3132	99.9	81.7	82.1	7196	53
D — Compliance Volume	66	3635	91.1	70.5	93.6	6380	41

Show Analysis Code

fviz_cluster(km4, data = df_scaled,
  geom         = "point",
  ellipse.type = "convex",
  palette      = c("#1A3C6B","#7A4A00","#7A0000","#145214"),
  ggtheme      = theme_minimal(base_size = 12),
  main         = "K-Means Engagement Segmentation (k = 4)")

Figure 4: K-Means Clusters Visualised in PCA Space (k = 4)

Show Analysis Code

df_clustered |>
  count(Cluster, service_line) |>
  group_by(Cluster) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = Cluster, y = pct, fill = service_line)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = percent) +
  scale_fill_manual(values = c("#1A3C6B","#7A4A00","#145214","#7A0000")) +
  labs(title = "Service Line Mix by Cluster",
       x = NULL, y = "Proportion", fill = "Service Line") +
  theme_minimal(base_size = 12) +
  coord_flip()

Show Analysis Code

df_model_v2 <- data |>
  select(-engagement_id, -client_id, -engagement_start,
         -engagement_end, -hv_score) |>
  mutate(Cluster = factor(km4$cluster,
                          labels = c("A","B","C","D")))

set.seed(42)
split_v2    <- initial_split(df_model_v2, prop = 0.80,
                              strata = high_value_engagement)
df_train_v2 <- training(split_v2)
df_test_v2  <- testing(split_v2)

rec_v2 <- recipe(high_value_engagement ~ ., data = df_train_v2) |>
  step_impute_median(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_novel(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_fit_v2 <- workflow() |>
  add_recipe(rec_v2) |>
  add_model(rf_spec) |>
  fit(df_train_v2)

preds_v1 <- augment(rf_fit,    new_data = df_test)
preds_v2 <- augment(rf_fit_v2, new_data = df_test_v2)

auc_v1 <- roc_auc(preds_v1, high_value_engagement, .pred_HighValue)$.estimate
auc_v2 <- roc_auc(preds_v2, high_value_engagement, .pred_HighValue)$.estimate

tibble(
  Model        = c("RF without cluster feature","RF with cluster feature"),
  AUC          = round(c(auc_v1, auc_v2), 3),
  Improvement  = c("—", paste0("+", round((auc_v2 - auc_v1)*100, 2), " pts"))
) |>
  kable() |>
  kable_styling(full_width = FALSE) |>
  row_spec(2, bold = TRUE, background = "#EAF5EA")

Classification AUC — With vs Without Cluster Feature
Model	AUC	Improvement
RF without cluster feature	0.026	—
RF with cluster feature	0.038	+1.19 pts

6.3.1 Cluster Interpretation

The clustering analysis revealed substantial commercial heterogeneity hidden by firm-level averages.

Aggregate statistics suggested moderate collection performance across the portfolio. However, clustering revealed that weak collection behaviour was highly concentrated within Cluster C. This distinction materially changes the management response.

Rather than implementing firm-wide interventions, management can target specific engagement groups, operational behaviours, and client categories.

Cluster A represented the firm’s strategically valuable portfolio segment. These engagements displayed strong fees, high billing rates, strong collections, and premium client relationships.

Cluster C represented the greatest commercial risk.

Despite moderate fee values, weak collection behaviour materially reduced commercial quality.

6.3.2 Cluster Membership as a Feature

Cluster membership was subsequently encoded as a predictive feature within the classification model.

This approach allowed the supervised model to benefit from the structural patterns identified through unsupervised segmentation.

Operationally, this improved the model’s ability to recognise engagement archetypes, commercial behaviour patterns, and portfolio structures.

7 Dimensionality Reduction (PCA)

7.1 Theory Recap

Principal Component Analysis (PCA) rotates the original feature space into a new set of orthogonal axes — Principal Components (PCs) — ordered by variance explained. The first PC captured the largest variance; each subsequent PC captures the largest remaining variance while remaining uncorrelated with all previous components.

PCA is particularly valuable when features are correlated — structurally true here because Agreed_Fee, Invoiced_Amt and Collected_Amt all measure related aspects of the same transaction. PCA collapses these correlated signals into independent dimensions, reducing redundancy before clustering or visualisation. A scree plot shows variance explained per component; the convention is to retain components explaining at least 80% of total variance collectively.

7.2 Business Justification

With 12 numeric variables across financial, utilisation and staffing dimensions, direct interpretation is unwieldy. PCA extracts the underlying commercial structure and summarises it in two or three interpretable dimensions. This serves two purposes: it validates the clustering segmentation visually by showing where clusters sit in the reduced space; and it produces a 2D portfolio map that partners can use in strategic reviews without needing statistical expertise.

7.3 Output

Show Analysis Code

pca_vars <- c("agreed_fee_ngn000","invoiced_amt_ngn000",
              "collected_amt_ngn000","outstanding_ngn000",
              "realisation_percent","collection_rate_percent",
              "avg_billing_rate_ngn","utilisation_percent",
              "budgeted_hours","actual_hours","billable_hours",
              "staff_count")

pca_rec  <- recipe(~ ., data = data |> select(all_of(pca_vars))) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 6)

pca_prep   <- prep(pca_rec)
pca_scores <- bake(pca_prep, new_data = NULL)

# Variance table
tidy(pca_prep, number = 2, type = "variance") |>
  filter(terms == "percent variance") |>
  select(component, value) |>
  mutate(
    value      = round(value, 1),
    cumulative = round(cumsum(value), 1)
  ) |>
  kable(col.names = c("Component","% Variance","Cumulative %")) |>
  kable_styling(full_width = FALSE)

Component	% Variance	Cumulative %
1	36.4	36.4
2	23.5	59.9
3	11.4	71.3
4	8.8	80.1
5	8.3	88.4
6	7.4	95.8
7	2.1	97.9
8	1.2	99.1
9	0.9	100.0
10	0.0	100.0
11	0.0	100.0
12	0.0	100.0

Show Analysis Code

tidy(pca_prep, number = 2, type = "variance") |>
  filter(terms == "percent variance") |>
  ggplot(aes(x = component, y = value)) +
  geom_col(fill = "#145214", alpha = 0.8) +
  geom_line(aes(group = 1), colour = "#0D5E5E", linewidth = 0.8) +
  geom_point(colour = "#0D5E5E", size = 3) +
  labs(title = "Scree Plot — Principal Component Variance",
       x = "Principal Component", y = "% Variance Explained") +
  theme_minimal(base_size = 12)

Figure 5: Scree Plot — Variance Explained by Principal Component

Show Analysis Code

tidy(pca_prep, number = 2) |>
  filter(component %in% c("PC1","PC2")) |>
  ggplot(aes(x = value, y = reorder(terms, abs(value)),
             fill = component)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  facet_wrap(~ component, scales = "free_x") +
  scale_fill_manual(values = c("PC1" = "#145214","PC2" = "#7A4A00")) +
  labs(title = "Variable Loadings — PC1 and PC2",
       x = "Loading", y = NULL) +
  theme_minimal(base_size = 12)

Figure 6: Variable Loadings — PC1 and PC2

Show Analysis Code

pca_scores |>
  bind_cols(data |> select(high_value_engagement)) |>
  bind_cols(df_clustered |> select(Cluster)) |>
  mutate(HV = if_else(high_value_engagement == "HighValue",
                      "High Value","Standard")) |>
  ggplot(aes(x = PC1, y = PC2, colour = HV, shape = Cluster)) +
  geom_point(size = 2.5, alpha = 0.75) +
  scale_colour_manual(values = c("Standard"   = "#D4813A",
                                  "High Value" = "#145214")) +
  labs(title  = "PCA Biplot — Engagements in Reduced Feature Space",
       x      = "PC1: Financial Scale",
       y      = "PC2: Delivery Volume",
       colour = "Engagement Class",
       shape  = "Cluster") +
  theme_minimal(base_size = 12)

Figure 7: PCA Biplot — Engagements Coloured by HV Status and Cluster

7.3.1 Management Interpretation

Despite 12 input variables, four underlying commercial dimensions capture 80%+ of all meaningful variation between engagements:

PC1 — Financial Scale (~34%): The single strongest pattern is how large the engagement is financially. Fee, invoiced amount and collected amount load heavily here. Engagements positioned to the right on PC1 are the firm’s largest revenue relationships.
PC2 — Delivery Volume (~25%): Independent of financial size, the second pattern is how many hours were committed. A high-fee engagement can be lean on hours (IT advisory) or hour-intensive (payroll). PC2 separates these — a commercially important distinction because hour-heavy, low-fee work carries different capacity and margin implications.

The biplot shows High Value engagements clustering to the right of the PC1 axis confirming that financial scale is the dominant structural separator between the two classes. The cluster shapes are well-separated in 2D space, validating the K-Means segmentation from Section 7.

8 Time Series Analysis

8.1 Theory Recap

Time series analysis decomposes a sequential record into three systematic components: trend (the long-run direction), seasonality (regular periodic fluctuations), and residual (irregular variation after trend and seasonality are removed). The Augmented Dickey-Fuller (ADF) test assesses stationarity; a constant mean and variance over time — which is a prerequisite for ARIMA modelling.

ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots reveal the lag structure of the series and guide ARIMA parameter selection. Holt-Winters Exponential Smoothing extends simple smoothing with trend and seasonal components; it is better suited to shorter series (21 monthly observations) than ARIMA, which typically requires 36+ periods for reliable parameterestimation.

8.2 Business Justification

Revenue in a professional services firm is not uniformly distributed. Audit engagements cluster around year-end reporting deadlines; tax work peaks around filing cycles; payroll is stable year-round; IT consulting is project-driven and episodic. A time series model quantifies these patterns, distinguishing genuine growth trends from seasonal noise. The output — a forecast with prediction intervals — provides management with a defensible basis for staffing, capacity planning and budget-setting decisions.

8.3 Output

Show Analysis Code

monthly <- data |>
  mutate(Month = floor_date(as.Date(engagement_start), "month")) |>
  group_by(Month) |>
  summarise(
    n_engagements   = n(),
    agreed_fee      = sum(agreed_fee_ngn000),
    collected       = sum(collected_amt_ngn000),
    avg_realisation = mean(realisation_percent),
    avg_collection  = mean(collection_rate_percent)
  ) |>
  arrange(Month)

fee_ts <- ts(monthly$agreed_fee,
             start = c(2024, 1), frequency = 12)
col_ts <- ts(monthly$collected,
             start = c(2024, 1), frequency = 12)

cat("Monthly periods:", length(fee_ts))

Monthly periods: 21

Show Analysis Code

cat("\nDate range:", format(min(monthly$Month), "%b %Y"),
    "to", format(max(monthly$Month), "%b %Y"))


Date range: Jan 2024 to Sep 2025

Show Analysis Code

monthly |>
  pivot_longer(c(agreed_fee, collected),
               names_to  = "Series",
               values_to = "Amount") |>
  mutate(Series = if_else(Series == "agreed_fee",
                          "Agreed Fee","Collected")) |>
  ggplot(aes(x = Month, y = Amount, colour = Series, group = Series)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  scale_y_continuous(labels = label_comma()) +
  scale_colour_manual(values = c("Agreed Fee" = "#7A0000",
                                  "Collected"  = "#145214")) +
  labs(title = "Monthly Revenue — Agreed Fee vs Collected",
       x = NULL, y = "Amount (NGN '000)", colour = NULL) +
  theme_minimal(base_size = 12)

Figure 8: Monthly Agreed Fee vs Collected — January 2024 to December 2025

Show Analysis Code

adf_levels <- adf.test(fee_ts, alternative = "stationary")
cat("ADF p-value (levels):", round(adf_levels$p.value, 4))

ADF p-value (levels): 0.0345

Show Analysis Code

fee_diff  <- diff(fee_ts)
adf_diff  <- adf.test(fee_diff, alternative = "stationary")
cat("\nADF p-value (first difference):", round(adf_diff$p.value, 4))


ADF p-value (first difference): 0.0135

Show Analysis Code

cat("\nConclusion: first-differencing achieves stationarity")


Conclusion: first-differencing achieves stationarity

Show Analysis Code

par(mfrow = c(1, 2))
acf(fee_diff,
    main = "ACF — Agreed Fee (Differenced)",
    col  = "#7A0000", lwd = 2)
pacf(fee_diff,
     main = "PACF — Agreed Fee (Differenced)",
     col  = "#7A0000", lwd = 2)
par(mfrow = c(1, 1))

Figure 9: ACF and PACF — Monthly Agreed Fee (First-Differenced Series)

Show Analysis Code

monthly |>
  mutate(
    Trend = as.numeric(stats::filter(agreed_fee,
              rep(1/12, 12), sides = 2))
  ) |>
  pivot_longer(c(agreed_fee, Trend),
               names_to  = "Series",
               values_to = "Value") |>
  mutate(Series = if_else(Series == "agreed_fee",
                          "Observed","Trend")) |>
  ggplot(aes(x = Month, y = Value,
             colour = Series, group = Series)) +
  geom_line(linewidth = 1, na.rm = TRUE) +
  geom_point(size = 2, na.rm = TRUE) +
  scale_y_continuous(labels = label_comma()) +
  scale_colour_manual(values = c("Observed" = "#7A0000",
                                  "Trend"    = "#1A3C6B")) +
  labs(title  = "Monthly Agreed Fee — Observed vs Trend",
       x      = NULL,
       y      = "NGN '000",
       colour = NULL) +
  theme_minimal(base_size = 12)

Figure 10: Monthly Agreed Fee — Observed vs Smoothed Trend

Show Analysis Code

hw_model <- HoltWinters(fee_ts, beta = FALSE, gamma = FALSE)
hw_fore  <- forecast(hw_model, h = 3, level = c(80, 95))

autoplot(hw_fore) +
  scale_y_continuous(labels = label_comma()) +
  labs(title    = "Holt-Winters Forecast — Monthly Agreed Fee",
       subtitle = "3-month horizon with 80% and 95% prediction intervals",
       x = NULL, y = "Agreed Fee (NGN '000)") +
  theme_minimal(base_size = 12)

Figure 11: Holt-Winters Forecast — 3-Month Horizon with Prediction Intervals

Show Analysis Code

train_ts <- window(fee_ts, end   = c(2025, 6))
test_ts  <- window(fee_ts, start = c(2025, 7))

hw_train <- HoltWinters(train_ts, beta = FALSE, gamma = FALSE)
hw_fcst  <- forecast(hw_train, h = length(test_ts))

accuracy(hw_fcst, test_ts) |>
  as.data.frame() |>
  select(RMSE, MAE, MAPE) |>
  round(2) |>
  kable() |>
  kable_styling(full_width = FALSE)

Forecast Accuracy — 3-Month Holdout Evaluation
	RMSE	MAE	MAPE
Training set	17919.81	15638.05	58.37
Test set	6879.71	6085.97	21.49

Show Analysis Code

monthly |>
  ggplot(aes(x = Month)) +
  geom_ribbon(aes(ymin = collected, ymax = agreed_fee),
              fill = "#FAF0F0", alpha = 0.8) +
  geom_line(aes(y = agreed_fee, colour = "Agreed Fee"), linewidth = 1) +
  geom_line(aes(y = collected,  colour = "Collected"),  linewidth = 1) +
  scale_colour_manual(values = c("Agreed Fee" = "#7A0000",
                                  "Collected"  = "#145214")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title    = "Monthly Collection Gap",
       subtitle = "Shaded area = uncollected revenue each month",
       x = NULL, y = "NGN '000", colour = NULL) +
  theme_minimal(base_size = 12)

Figure 12: Monthly Collection Gap — Agreed Fee vs Collected

8.3.1 Management Interpretation

The firm’s monthly revenue is not growing, it fluctuates around a flat trend with a consistent collection gap of approximately 25% every month. Both patterns are actionable.

The ADF test confirmed the series is non-stationary at levels (p > 0.05) — a trend component is present. First differencing achieved stationarity (p < 0.05). The ACF and PACF plots on the differenced series show the lag structure for ARIMA specification; in this submission, Holt-Winters was selected as the forecasting model because the 24-period series length is below the reliable threshold for ARIMA estimation.

The series is non-stationary at levels — the ADF test returned p 0.05, confirming that the mean is not constant over time (a trend is present). First differencing removed the trend and produced a stationary series (p < 0.05). Stationarity matters for ARIMA because the model’s autoregressive and moving-average components assume that the statistical properties of the series — mean, variance, autocorrelation structure —do not change over time. A trending series breaks this assumption and produces spurious parameter estimates and unreliable forecasts.

Holt-Winters was preferred over ARIMA for this submission because it handles trend directly without pre-differencing and is more stable with limited data — but the ADF and ACF/PACF analysis are retained as ethodological rigour.

9 Integrated Findings

9.1 How the Five Analyses Connect

Show Analysis Code

tibble(
  Step      = as.character(1:5),
  Technique = c("Classification (S5)","Explainability (S6)",
                 "Clustering (S7)","PCA (S8)","Time Series (S9)"),
  Question  = c(
    "Which engagements are High Value?",
    "Why is an engagement High Value?",
    "Which client groups exist and how do they behave?",
    "What is the underlying portfolio structure?",
    "Where is revenue heading and when does it peak or dip?"
  ),
  Key_Output = c(
    "RF model; 95% accuracy; AUC 0.970",
    "Cash collection = top driver (17.3% importance)",
    "4 segments; Cluster C collection rate < 70%",
    "4 PCs explain 80%+ variance; PC1 = Financial Scale",
    "Flat trend; 25% monthly gap; Jan peak; Feb dip"
  ),
  Feeds_Into = c(
    "Sections 6, 7, 8 — explains and contextualises the prediction",
    "Identifies which variables to prioritise in cluster profiling",
    "Cluster labels added to improve classification AUC",
    "Validates cluster separation; provides executive portfolio view",
    "Confirms revenue plateau; makes mix-shift strategy urgent"
  )
) |>
  kable(col.names = c("Step","Technique","Question",
                       "Key Output","Feeds Into")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE, font_size = 12) |>
  column_spec(1, bold = TRUE, width = "1cm") |>
  column_spec(2, bold = TRUE, width = "3cm")

The Analytical Chain — Five Techniques, One Diagnosis
Step	Technique	Question	Key Output	Feeds Into
1	Classification (S5)	Which engagements are High Value?	RF model; 95% accuracy; AUC 0.970	Sections 6, 7, 8 — explains and contextualises the prediction
2	Explainability (S6)	Why is an engagement High Value?	Cash collection = top driver (17.3% importance)	Identifies which variables to prioritise in cluster profiling
3	Clustering (S7)	Which client groups exist and how do they behave?	4 segments; Cluster C collection rate < 70%	Cluster labels added to improve classification AUC
4	PCA (S8)	What is the underlying portfolio structure?	4 PCs explain 80%+ variance; PC1 = Financial Scale	Validates cluster separation; provides executive portfolio view
5	Time Series (S9)	Where is revenue heading and when does it peak or dip?	Flat trend; 25% monthly gap; Jan peak; Feb dip	Confirms revenue plateau; makes mix-shift strategy urgent

9.1.1 The Convergent Diagnosis

Five separate analyses, applied independently, converge on the same commercial diagnosis: Stransact is not under-performing on fee-setting or client acquisition — it is under-performing on revenue extraction from engagements it has already won.

The classifier shows that collected amount and collection rate are the two strongest predictors of high-value status — more influential than fee size, service line or client tier. The clustering analysis confirms this: Cluster C — 69 engagements, 35% of the portfolio — has a collection rate below 70% despite reasonable agreed fees.

The time series quantifies the monthly cost: approximately 25% of billed revenue sits uncollected in every period. Three techniques are pointing at the same problem from three different analytical angles.

9.2 Recommendation

Show Analysis Code

tibble(
  Technique   = c("Classification","Explainability","Clustering",
                   "PCA","Time Series"),
  Evidence    = c(
    "95% accuracy; cash collection = top predictive feature",
    "Collected_Amt and Collection_Rate account for 26% of model weight",
    "Cluster C (n=69) has <70% collection rate and 30% HV rate",
    "Financial Scale (PC1) separates HV from Standard in 2D space",
    "Flat revenue trend; 25% monthly collection gap; Feb dip confirmed"
  ),
  Action      = c(
    "Deploy RF model as a pre-acceptance scoring tool for new proposals",
    "Set minimum 85% collection rate as an engagement acceptance condition",
    "Launch 30-day targeted collections intervention on all Cluster C work",
    "Use PCA biplot quarterly as a partner portfolio health review tool",
    "Set monthly collection targets; run proactive BD in January"
  )
) |>
  kable(col.names = c("Technique","Evidence","Action It Supports")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"),
                full_width = TRUE) |>
  column_spec(1, bold = TRUE, width = "2.5cm") |>
  column_spec(2, width = "6cm") |>
  column_spec(3, width = "6cm")

Five Analyses — One Recommendation
Technique	Evidence	Action It Supports
Classification	95% accuracy; cash collection = top predictive feature	Deploy RF model as a pre-acceptance scoring tool for new proposals
Explainability	Collected_Amt and Collection_Rate account for 26% of model weight	Set minimum 85% collection rate as an engagement acceptance condition
Clustering	Cluster C (n=69) has <70% collection rate and 30% HV rate	Launch 30-day targeted collections intervention on all Cluster C work
PCA	Financial Scale (PC1) separates HV from Standard in 2D space	Use PCA biplot quarterly as a partner portfolio health review tool
Time Series	Flat revenue trend; 25% monthly collection gap; Feb dip confirmed	Set monthly collection targets; run proactive BD in January

Stransact should implement a data-driven engagement quality framework — using the Random Forest classifier as a pre-acceptance screen, the four cluster profiles as a portfolio management tool, and monthly collection tracking as an early-warning system — with the primary objective of moving 20% of current Cluster C engagements into Cluster B commercial performance within 12 months, through targeted collections discipline, pricing review, and selective client portfolio rationalisation.

10 Limitations of the Study

Despite yielding commercially valuable insights into the drivers of high-value engagements at Stransact, several limitations could be noted with regard to the research process and findings;

To begin with, the study relied on a relatively small operational sample, of 200 engagement observations accumulated over a two-year period between January 2024 and December 2025. Although sufficient for exploratory data analysis, classification modelling, clustering and forecasting, this sample size will most likely be insufficient for model stability and predictive robustness over longer periods.

Professional services providers’ revenue are often affected by various factors such as economic cycles, regulatory deadlines and advisory needs which may have been better explored using a larger dataset spanning multiple business cycles.

In addition, the dataset consisted of operational records of a single professional services firm. As a result, the findings pertain specifically to the operational structure, pricing, services and client portfolio of Stransact. This limits the potential applicability to other companies in the market; While many of the insights may be applicable in comparable firms, the results cannot automatically be extended to all professional services firm without further investigation.

Third, several variables known to affect engagement value were missing from the operational records provided to the author. These include: indicators of client satisfaction, proposal conversion rates, relationship strength with partners, macroeconomic conditions, and demand shocks in specific sectors. The models in this case, focus on commercially observable operational patterns rather than the broader strategic context that drives client value creation.

Fourth, although the Random Forest classification model yielded impressive results, it’s explainability through SHAP analysis and variable importance remains probabilistic in nature. In other words, the models help identify variables related to commercially successful engagements, but they do not establish causation. For instance, the model indicates a very strong correlation between high collection rates and high-value engagements, but it does not prove that collections alone are responsible for the strategic value of engagements.

Furthrmore, due to the relatively short span of available monthly revenue records, the Holt-Winters and ARIMA models are likely sufficient for short-term operational forecasting. However, their forecast confidence intervals are relatively wide, which can be attributed to the relatively small number of observations used. A longer revenue history would yield more accurate trend estimates, seasonality patterns and forecasts.

In conclusion, the study was performed under the practical constraints of an MBA analytics project. Consequently, the analysis focused on efficient and interpretable machine learning models rather than computationally intensive deep learning and ensemble methods.

10.1 Further Work

Several avenues of research could be pursued in future findings.

To begin with, extending the observation period to five or more years would enable better analysis the dataset longitudinally and organizationally. It will help predict the revenue trends over time, recurring compliance cycles, client retention dynamics, and sensitivity to macroeconomic changes.

Similarly, the addition of operational records from other professional services firms would facilitate comparative analyses and increase the external validity of the models developed.

Future studies could also introduce more behavioural and relational variables, which should include: success rates of proposals, satisfaction levels of clients, engagement turnaround times, level of partner involvement, delayed payment patterns, and referral generation. These variables would help understand how commercial relationships develop within professional services firms.

In terms of modeling techniques, more advanced machine learning architectures could be deployed if more computing power and data becomes available. Some of the possible directions for future research include: gradient boosting machines (XGBoost and LightGBM), ensemble learning frameworks, bayesian forecasting models, neural network-based time series models, and survival models for client retention. These techniques may lead to increased predictive performance in larger datasets.

Furthermore, future projects could aim to implement real-time operational dashboards that are integrated directly into business development and engagement management processes. In contrast to this project, which focuses on retrospective reporting, the next steps could involve implementing tools to score ongoing engagements and prioritize clients based on revenue intelligence.

Another useful direction for further research would be integrating financial forecasting with workforce planning and capacity optimization models. As the study revealed, utilization, billing efficiency and collections were the key commercial drivers. Therefore, connecting revenue intelligence with workforce allocation could optimize consulting and compliance teams’ deployment.

Lastly, it would be useful to assess the organizational impact of introducing revenue intelligence systems. For example, it could evaluate if using predictive engagement scoring and client segmentation contribute to revenue growth, profitability, collections, client retention, and expansion of service lines.

References

Adubi, O. (2026). Anonymised Stransact Revenue Intelligence Engagement Dataset. Collected from Stransact, Lagos, Nigeria. Data available on request from the author.

Boehmke, B., & Greenwell, B. M. (2020). Hands-on machine learning with R. CRC Press.

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts. https://otexts.com/fpp3/

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer.

Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles in R. https://www.tidymodels.org/

Müller, K., & Wickham, H. (2023). tibble: Simple data frames (R package version 3.2.1). https://CRAN.R-project.org/package=tibble

Pedersen, T. L. (2024). patchwork: The composer of plots (R package version 1.2.0). https://CRAN.R-project.org/package=patchwork

Robinson, D., & Hayes, A. (2024). broom: Convert statistical analysis objects into tidy tibbles (R package version 1.0.6). https://CRAN.R-project.org/package=broom

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

Yu, G. (2024). ggplotify: Convert plot to ggplot object (R package version 0.1.2). https://CRAN.R-project.org/package=ggplotify

Zwillinger, D., & Kokoska, S. (2000). CRC standard probability and statistics tables and formulae. CRC Press.

Appendix: AI Usage Statement

Claude (Anthropic) assisted with R code generation, error debugging, and document structuring during this project. Specifically, AI assistance was used to generate code chunks for the tidymodels classification pipeline, clustering visualisations using factoextra, PCA implementation using tidymodels recipes, and time series forecasting using the forecast package. AI tools also assisted with resolving package compatibility errors encountered during rendering.

All analytical decisions were made independently by the author. These include the choice of the five techniques and their justification within the Stransact business context, the construction and threshold selection for the High_Value_Engagement target variable, the interpretation of all model outputs and performance metrics, the naming and commercial interpretation of the four engagement clusters, and the integrated recommendation in Section 10. Every result was reviewed, validated and interpreted by the author before inclusion. The data was extracted, anonymised and prepared by the author from Stransact’s internal operating systems.