Engineering Value
An Exploratory & Inferential Study of MEP Project Performance at CA Consultants Limited
1. Executive Summary
This study analyses the project portfolio of CA Consultants Limited, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on 100 real project records spanning 2006–2026, extracted from the firm’s internal project management systems, I investigate what drives project value across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression — are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.
2. Professional Disclosure
Name: Chidubem Mba Title: Senior Project Engineer & Project Manager Organisation: CA Consultants Limited, Lagos, Nigeria Sector: Construction / Building Services (MEP Engineering)
CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings — from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations.
Technique Justification
Exploratory Data Analysis (EDA): Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses.
Data Visualisation: Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence — showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting.
Hypothesis Testing: Resource allocation decisions at CA Consultants — such as whether to pursue more commercial versus industrial work — must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy.
Correlation Analysis: Understanding the relationships between project characteristics — value, design duration, revision count, and scope — is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors.
Linear Regression: The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.
3. Data Collection & Sampling
Source and Collection Method
The dataset was extracted directly from the internal project register of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes — from initial proposal through to completion. Data extraction was performed manually from the company’s enterprise filing system and project tracking spreadsheets in April 2026.
Sampling Frame and Size
The dataset constitutes a census of all identifiable projects in the firm’s records from 2006 to 2026, comprising 100 project records. This is not a sample but a near-complete enumeration of the firm’s project history, subject to the availability of records in the company’s systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records.
Variables Collected
| Variable | Type | Description |
|---|---|---|
Project_ID |
Identifier | Unique project code assigned by the firm |
Project_Type |
Categorical | Building type: Commercial, Industrial, or Residential |
Project_Status |
Categorical | Current status: Completed, Ongoing, or Proposal |
Project_Value_NGN |
Numeric (continuous) | Contract value in Nigerian Naira (₦) |
Design_Duration_Days |
Numeric (continuous) | Duration of the design phase in calendar days |
Revision_Count |
Numeric (discrete) | Number of design revisions issued |
Project_Year |
Date/Time (year) | Year the project was initiated |
Discipline |
Categorical | Scope of MEP services: MEP (full), E Only, M Only |
Handling of Missing Data
A total of 8 observations are missing Design_Duration_Days and 14 observations are missing Revision_Count. These are structurally missing — not data-entry errors. These projects were construction supervision engagements only: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis — these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression).
Ethical Statement
This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm’s project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.
4. Data Description
4.1 Summary Statistics
View Code
df %>%
select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
group_by(Variable) %>%
summarise(
N = sum(!is.na(Value)),
Missing = sum(is.na(Value)),
Mean = mean(Value, na.rm = TRUE),
Median = median(Value, na.rm = TRUE),
SD = sd(Value, na.rm = TRUE),
Min = min(Value, na.rm = TRUE),
Max = max(Value, na.rm = TRUE),
Skewness = skewness(Value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
kbl(caption = "Descriptive Statistics for Numeric Variables",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Variable | N | Missing | Mean | Median | SD | Min | Max | Skewness |
|---|---|---|---|---|---|---|---|---|
| Design_Duration_Days | 92 | 8 | 65.68 | 40 | 89.63 | 2 | 600 | 3.44 |
| Project_Value | 100 | 0 | 65,225,972.06 | 39,816,154 | 75,188,594.11 | 258,557 | 360,000,000 | 1.97 |
| Project_Year | 100 | 0 | 2,021.34 | 2,023 | 3.72 | 2,006 | 2,026 | -1.26 |
| Revision_Count | 86 | 14 | 3.67 | 4 | 1.77 | 1 | 6 | -0.22 |
4.2 Categorical Variable Distribution
View Code
cat_summary <- bind_rows(
df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"),
df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"),
df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline")
) %>%
group_by(Variable) %>%
mutate(Pct = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
select(Variable, Category, Count = n, `%` = Pct)
cat_summary %>%
kbl(caption = "Frequency Distribution of Categorical Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>%
pack_rows(index = table(cat_summary$Variable))| Variable | Category | Count | % |
|---|---|---|---|
| Discipline | |||
| Project_Type | Commercial | 64 | 64 |
| Project_Type | Industrial | 14 | 14 |
| Project_Type | Residential | 22 | 22 |
| Project_Status | |||
| Project_Status | Completed | 57 | 57 |
| Project_Status | Ongoing | 15 | 15 |
| Project_Status | Proposal | 28 | 28 |
| Project_Type | |||
| Discipline | E Only | 9 | 9 |
| Discipline | M Only | 5 | 5 |
| Discipline | MEP | 86 | 86 |
4.3 Interactive Data Explorer
View Code
df %>%
select(Project_ID, Project_Type, Project_Status, Value_Millions,
Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>%
mutate(Value_Millions = round(Value_Millions, 2)) %>%
rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days,
Revisions = Revision_Count, Year = Project_Year,
Type = Project_Type, Status = Project_Status, ID = Project_ID) %>%
datatable(
caption = "Full Project Dataset — CA Consultants Limited",
filter = "top",
options = list(pageLength = 10, scrollX = TRUE,
dom = 'Bfrtip',
initComplete = JS(
"function(settings, json) {",
" $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});",
"}"
)),
rownames = FALSE
)5. Exploratory Data Analysis (EDA)
“Far better an approximate answer to the right question… than an exact answer to the wrong question.” — John Tukey
Business Justification: Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics.
5.1 Missing Value Analysis
View Code
missing_summary <- df %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
mutate(
Total = nrow(df),
`% Missing` = round(100 * Missing / Total, 1),
Status = if_else(Missing == 0, "Complete", "Has Missing")
) %>%
filter(Missing > 0) %>%
arrange(desc(Missing))
missing_summary %>%
kbl(caption = "Variables with Missing Values") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Variable | Missing | Total | % Missing | Status |
|---|---|---|---|---|
| Revision_Count | 14 | 100 | 14 | Has Missing |
| Design_Duration_Days | 8 | 100 | 8 | Has Missing |
5.2 Outlier Detection
View Code
# IQR-based outlier detection on Project Value
Q1 <- quantile(df$Project_Value, 0.25, na.rm = TRUE)
Q3 <- quantile(df$Project_Value, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val
outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence)
cat(sprintf("**IQR Method for Project Value:**\n\n- Q1 = ₦%s | Q3 = ₦%s | IQR = ₦%s\n- Upper fence = ₦%s\n- **%d outliers detected** (%.1f%% of data) — all on the high end\n",
comma(Q1), comma(Q3), comma(IQR_val),
comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df)))**IQR Method for Project Value:**
- Q1 = ₦13,793,971 | Q3 = ₦86,419,210 | IQR = ₦72,625,240
- Upper fence = ₦195,357,070
- **8 outliers detected** (8.0% of data) — all on the high end
View Code
ggplot(df, aes(x = Project_Value / 1e6)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") +
geom_density(colour = "#ff006e", linewidth = 1.2) +
geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence",
colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) +
labs(title = "Distribution of Project Values",
subtitle = "Right-skewed with high-value outliers typical of construction portfolios",
x = "Project Value (₦ Millions)", y = "Density") +
scale_x_continuous(labels = comma_format())5.3 Distribution by Category
View Code
p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_type) +
labs(title = "Projects by Building Type", x = NULL, y = "Count") +
guides(fill = "none")
p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_status) +
labs(title = "Projects by Status", x = NULL, y = "Count") +
guides(fill = "none")
p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_disc) +
labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") +
guides(fill = "none")
(p1 | p2 | p3) +
plot_annotation(
title = "Portfolio Composition at a Glance",
subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio",
theme = theme_ca_dark()
)View Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({
'figure.facecolor': '#161b22',
'axes.facecolor': '#161b22',
'axes.edgecolor': '#30363d',
'axes.labelcolor': '#c9d1d9',
'text.color': '#c9d1d9',
'xtick.color': '#8b949e',
'ytick.color': '#8b949e',
'grid.color': '#30363d',
'font.family': 'sans-serif',
'font.size': 11
})
# Load and clean data
df_py = pd.read_csv("mep_projects.csv")
df_py['Project_Value'] = df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True).astype(float)
df_py['Value_Millions'] = df_py['Project_Value'] / 1e6
# Summary statistics
print("=" * 60)============================================================
View Code
print("NUMERIC VARIABLE SUMMARY")NUMERIC VARIABLE SUMMARY
View Code
print("=" * 60)============================================================
View Code
print(df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2).to_string()) Project_Value Design_Duration_Days Revision_Count Project_Year
count 1.000000e+02 92.00 86.00 100.00
mean 6.522597e+07 65.68 3.67 2021.34
std 7.518859e+07 89.63 1.77 3.72
min 2.585570e+05 2.00 1.00 2006.00
25% 1.379397e+07 20.00 2.00 2019.00
50% 3.981615e+07 40.00 4.00 2023.00
75% 8.641921e+07 80.00 4.00 2024.00
max 3.600000e+08 600.00 6.00 2026.00
View Code
print("\n" + "=" * 60)
============================================================
View Code
print("MISSING VALUES")MISSING VALUES
View Code
print("=" * 60)============================================================
View Code
print(df_py.isnull().sum().to_string())Project_ID 0
Project_Type 0
Project_Status 0
Project_Value_NGN 0
Design_Duration_Days 8
Revision_Count 14
Project_Year 0
Discipline 0
Project_Value 0
Value_Millions 0
View Code
print("\n" + "=" * 60)
============================================================
View Code
print("CATEGORICAL DISTRIBUTIONS")CATEGORICAL DISTRIBUTIONS
View Code
print("=" * 60)============================================================
View Code
for col in ['Project_Type', 'Project_Status', 'Discipline']:
print(f"\n{col}:")
counts = df_py[col].value_counts()
for val, cnt in counts.items():
print(f" {val}: {cnt} ({100*cnt/len(df_py):.1f}%)")
Project_Type:
Commercial: 64 (64.0%)
Residential: 22 (22.0%)
Industrial: 14 (14.0%)
Project_Status:
Completed: 57 (57.0%)
Proposal: 28 (28.0%)
Ongoing: 15 (15.0%)
Discipline:
MEP: 86 (86.0%)
E Only: 9 (9.0%)
M Only: 5 (5.0%)
View Code
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
# Histogram of values
axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22')
axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Value (₦ Millions)')
axes[0].set_ylabel('Frequency')
# Bar chart: Project Type
type_counts = df_py['Project_Type'].value_counts()
colors_type = ['#00d4ff', '#ff006e', '#ffbe0b']
axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85)
axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold')
axes[1].set_xlabel('Count')
# Bar chart: Status
status_counts = df_py['Project_Status'].value_counts()
colors_status = ['#00ff87', '#ffbe0b', '#8b949e']
axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85)
axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold')
axes[2].set_xlabel('Count')
plt.tight_layout()
plt.show()6. Data Visualisation
Business Justification: Visualisation transforms CA Consultants’ project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question — and explaining why — is itself an analytical skill.
6.1 Portfolio Value by Building Type
View Code
p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
geom_violin(alpha = 0.3, colour = NA, scale = "width") +
geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") +
geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
labs(title = "Project Value Distribution by Building Type",
subtitle = "Commercial projects show the widest range; Residential clusters at lower values",
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none", colour = "none") +
scale_y_continuous(labels = comma_format())
ggplotly(p, tooltip = c("y")) %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average
Why a violin + boxplot? A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts.
6.2 Portfolio Evolution Over Time
View Code
year_type <- df %>%
count(Project_Year, Project_Type) %>%
complete(Project_Year, Project_Type, fill = list(n = 0))
ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) +
geom_area(alpha = 0.7, position = "stack") +
scale_fill_manual(values = pal_type) +
labs(title = "CA Consultants Project Volume Over Time",
subtitle = "Accelerating growth since 2020 driven by commercial engagements",
x = "Year", y = "Number of Projects", fill = "Building Type") +
scale_x_continuous(breaks = seq(2006, 2026, 2))6.3 Design Complexity: Duration vs. Revisions
View Code
p_scatter <- df %>%
filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>%
ggplot(aes(x = Design_Duration_Days, y = Revision_Count,
size = Value_Millions, colour = Project_Type,
text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) +
geom_point(alpha = 0.7) +
scale_colour_manual(values = pal_type) +
scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) +
labs(title = "Design Complexity Landscape",
subtitle = "Longer design phases and more revisions correlate with higher-value projects",
x = "Design Duration (Days)", y = "Number of Revisions",
colour = "Building Type", size = "Value (₦M)")
ggplotly(p_scatter, tooltip = "text") %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))Each bubble is a project — size encodes value, colour encodes building type
Why a bubble chart? Three dimensions of project complexity — duration, revision count, and value — are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees.
6.4 Revenue Concentration by Project Size Category
View Code
size_summary <- df %>%
group_by(Value_Category) %>%
summarise(
Count = n(),
Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9,
.groups = "drop"
) %>%
mutate(
Pct_Count = 100 * Count / sum(Count),
Pct_Value = 100 * Total_Value / sum(Total_Value)
)
p_conc <- size_summary %>%
pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>%
mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>%
ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) +
geom_col(position = "dodge", alpha = 0.85, width = 0.6) +
scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) +
labs(title = "The Revenue Concentration Paradox",
subtitle = "Mega-projects are rare but generate a disproportionate share of revenue",
x = NULL, y = "Percentage (%)", fill = NULL) +
geom_text(aes(label = paste0(round(Pct, 1), "%")),
position = position_dodge(width = 0.6), vjust = -0.5,
colour = "#c9d1d9", size = 3.2)
p_conc6.5 Discipline Scope and Project Value
View Code
ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) +
geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) +
scale_fill_manual(values = pal_disc) +
coord_flip() +
labs(title = "Project Value by Discipline Scope",
subtitle = "Full-scope MEP projects command significantly higher contract values",
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none") +
scale_y_continuous(labels = comma_format())View Code
import seaborn as sns
fig, axes = plt.subplots(2, 2, figsize=(14, 11))
# 1. Violin: Value by Type
palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'}
order = ['Commercial', 'Industrial', 'Residential']
sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order,
palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0])
axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('Value (₦ Millions)')
# 2. Timeline
year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0)
year_counts.plot.area(ax=axes[0,1], alpha=0.7,
color=[palette_type.get(c, '#888') for c in year_counts.columns])
axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold')
axes[0,1].set_xlabel('Year')
axes[0,1].set_ylabel('Number of Projects')
axes[0,1].legend(fontsize=8)
# 3. Scatter: Duration vs Value
mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna()
for ptype, colour in palette_type.items():
subset = df_py[mask & (df_py['Project_Type'] == ptype)]
axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'],
c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none')
axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold')
axes[1,0].set_xlabel('Design Duration (Days)')
axes[1,0].set_ylabel('Value (₦ Millions)')
axes[1,0].legend(fontsize=8)
# 4. Box: Value by Discipline
disc_order = ['MEP', 'E Only', 'M Only']
palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'}
sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order,
palette=palette_disc, ax=axes[1,1])
axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold')
axes[1,1].set_xlabel('')
axes[1,1].set_ylabel('Value (₦ Millions)')
plt.tight_layout()
plt.show()7. Hypothesis Testing
Business Justification: When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from “it looks like commercial projects are worth more” to “we can be 95% confident that the difference is real and not due to chance.” This is the difference between data-informed strategy and guesswork.
7.1 Hypothesis 1: Project Value Differs by Building Type
H₀: The mean project value is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean project value. Test: Given the right-skewed distribution, we use the Kruskal-Wallis rank-sum test (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison.
View Code
# Check normality within groups (Shapiro-Wilk)
norm_tests <- df %>%
group_by(Project_Type) %>%
summarise(
n = n(),
shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_,
.groups = "drop"
)
norm_tests %>%
mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>%
kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Project_Type | n | shapiro_p | Normal |
|---|---|---|---|
| Commercial | 64 | 0.0000 | No |
| Industrial | 14 | 0.0075 | No |
| Residential | 22 | 0.0001 | No |
View Code
# Kruskal-Wallis Test (non-parametric)
kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df)
# Effect size (epsilon-squared)
kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type)
# Pairwise Wilcoxon with Bonferroni correction
pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni")
cat(sprintf("**Kruskal-Wallis Test:**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n",
kw_test$statistic, kw_test$parameter, kw_test$p.value,
kw_effect$effsize, kw_effect$magnitude))**Kruskal-Wallis Test:**
- χ² = 0.44, df = 2, **p-value = 0.8044**
- Effect size (η²[H]): -0.016 (small)
View Code
# Display pairwise comparisons
cat("**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**\n\n")**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**
View Code
print(pairwise)
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: df$Project_Value and df$Project_Type
Commercial Industrial
Industrial 1 -
Residential 1 1
P value adjustment method: bonferroni
View Code
# Group medians for annotation
group_med <- df %>%
group_by(Project_Type) %>%
summarise(med = median(Project_Value / 1e6), .groups = "drop")
ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
geom_text(data = group_med, aes(x = Project_Type, y = med,
label = paste0("Median: ₦", round(med, 1), "M")),
vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
labs(title = "Project Value by Building Type",
subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
kw_test$p.value,
if_else(kw_test$p.value < 0.05,
"Significant difference detected",
"No significant difference")),
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none", colour = "none") +
scale_y_continuous(labels = comma_format())7.2 Hypothesis 2: Design Duration Differs by Building Type
H₀: The mean design duration is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean design duration. Test: Kruskal-Wallis (non-parametric), using only projects with design data.
View Code
df_dur <- df %>% filter(!is.na(Design_Duration_Days))
# Kruskal-Wallis
kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur)
kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type)
cat(sprintf("**Kruskal-Wallis Test (Design Duration ~ Building Type):**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n- Observations used: %d (excluding supervision-only projects)\n",
kw_dur$statistic, kw_dur$parameter, kw_dur$p.value,
kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur)))**Kruskal-Wallis Test (Design Duration ~ Building Type):**
- χ² = 8.02, df = 2, **p-value = 0.0182**
- Effect size (η²[H]): 0.068 (moderate)
- Observations used: 92 (excluding supervision-only projects)
View Code
dur_med <- df_dur %>%
group_by(Project_Type) %>%
summarise(med = median(Design_Duration_Days), .groups = "drop")
ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) +
geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
geom_text(data = dur_med, aes(x = Project_Type, y = med,
label = paste0("Median: ", round(med, 0), " days")),
vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
labs(title = "Design Duration by Building Type",
subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
kw_dur$p.value,
if_else(kw_dur$p.value < 0.05,
"Significant difference detected",
"No significant difference")),
x = NULL, y = "Design Duration (Days)") +
guides(fill = "none", colour = "none")View Code
from scipy import stats
# Clean numeric columns
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce')
df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce')
# Hypothesis 1: Value ~ Project Type
groups_val = [grp['Project_Value'].dropna().values
for _, grp in df_py.groupby('Project_Type')]
kw_stat, kw_p = stats.kruskal(*groups_val)
print("=" * 60)============================================================
View Code
print("HYPOTHESIS 1: Project Value ~ Building Type")HYPOTHESIS 1: Project Value ~ Building Type
View Code
print("=" * 60)============================================================
View Code
print(f"Kruskal-Wallis H = {kw_stat:.3f}, p-value = {kw_p:.4f}")Kruskal-Wallis H = 0.435, p-value = 0.8044
View Code
print(f"Decision: {'Reject H₀' if kw_p < 0.05 else 'Fail to reject H₀'} at α = 0.05")Decision: Fail to reject H₀ at α = 0.05
View Code
# Hypothesis 2: Duration ~ Project Type
df_dur_py = df_py.dropna(subset=['Design_Duration_Days'])
groups_dur = [grp['Design_Duration_Days'].dropna().values
for _, grp in df_dur_py.groupby('Project_Type')]
kw_stat2, kw_p2 = stats.kruskal(*groups_dur)
print(f"\n{'=' * 60}")
============================================================
View Code
print("HYPOTHESIS 2: Design Duration ~ Building Type")HYPOTHESIS 2: Design Duration ~ Building Type
View Code
print("=" * 60)============================================================
View Code
print(f"Kruskal-Wallis H = {kw_stat2:.3f}, p-value = {kw_p2:.4f}")Kruskal-Wallis H = 8.015, p-value = 0.0182
View Code
print(f"Decision: {'Reject H₀' if kw_p2 < 0.05 else 'Fail to reject H₀'} at α = 0.05")Decision: Reject H₀ at α = 0.05
View Code
# Group medians
print(f"\n{'=' * 60}")
============================================================
View Code
print("GROUP MEDIANS — Project Value (₦ Millions)")GROUP MEDIANS — Project Value (₦ Millions)
View Code
print("=" * 60)============================================================
View Code
print(df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_string())Project_Type
Commercial 37.7
Industrial 33.7
Residential 42.6
8. Correlation Analysis
Business Justification: Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions.
8.1 Correlation Matrix
View Code
# Use Spearman due to non-normality
corr_data <- df_design %>%
select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
rename(
`Project Value` = Project_Value,
`Design Duration` = Design_Duration_Days,
`Revision Count` = Revision_Count,
`Project Year` = Project_Year
)
corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs")
ggcorrplot(corr_mat,
type = "lower",
lab = TRUE,
lab_size = 4.5,
lab_col = "#f0f6fc",
colors = c("#ff006e", "#161b22", "#00d4ff"),
outline.color = "#30363d",
ggtheme = theme_ca_dark()) +
labs(title = "Spearman Correlation Matrix",
subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design)))8.2 Key Correlations — Business Interpretation
View Code
# Extract and rank correlations
corr_pairs <- as.data.frame(as.table(corr_mat)) %>%
filter(Var1 != Var2) %>%
mutate(abs_corr = abs(Freq)) %>%
arrange(desc(abs_corr)) %>%
distinct(abs_corr, .keep_all = TRUE) %>% # remove duplicates
head(6) %>%
rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>%
select(-abs_corr)
corr_pairs %>%
mutate(Spearman_rho = round(Spearman_rho, 3),
Strength = case_when(
abs(Spearman_rho) >= 0.7 ~ "Strong",
abs(Spearman_rho) >= 0.4 ~ "Moderate",
TRUE ~ "Weak"
)) %>%
kbl(caption = "Ranked Pairwise Correlations") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Variable_1 | Variable_2 | Spearman_rho | Strength |
|---|---|---|---|
| Revision Count | Design Duration | 0.764 | Strong |
| Design Duration | Project Value | 0.676 | Moderate |
| Revision Count | Project Value | 0.567 | Moderate |
| Project Year | Project Value | 0.369 | Weak |
| Project Year | Revision Count | 0.058 | Weak |
| Project Year | Design Duration | 0.044 | Weak |
8.3 Scatterplot: Strongest Correlation
View Code
p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) +
geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")),
alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15,
linetype = "dashed", formula = y ~ x) +
scale_colour_manual(values = pal_type) +
labs(title = "Design Duration vs. Project Value",
subtitle = "Longer design phases are associated with higher-value projects",
x = "Design Duration (Days)", y = "Project Value (₦ Millions)",
colour = "Building Type") +
scale_y_continuous(labels = comma_format())
ggplotly(p_corr, tooltip = "text") %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))Relationship between design duration and project value — the strongest observed correlation
View Code
# Prepare design-complete subset
df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count'])
corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']
corr_matrix = df_design_py[corr_cols].corr(method='spearman')
fig, ax = plt.subplots(figsize=(8, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f',
cmap='coolwarm', center=0, linewidths=1,
linecolor='#30363d', cbar_kws={'shrink': 0.8},
annot_kws={'color': '#f0f6fc', 'fontsize': 12},
ax=ax)
ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15)
plt.tight_layout()
plt.show()View Code
# Print top correlations
print("\nTop Correlations (by absolute value):")
Top Correlations (by absolute value):
View Code
pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
pairs.append((corr_matrix.columns[i], corr_matrix.columns[j],
corr_matrix.iloc[i, j]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for v1, v2, rho in pairs:
print(f" {v1} — {v2}: ρ = {rho:.3f}") Design_Duration_Days — Revision_Count: ρ = 0.764
Project_Value — Design_Duration_Days: ρ = 0.676
Project_Value — Revision_Count: ρ = 0.567
Project_Value — Project_Year: ρ = 0.369
Revision_Count — Project_Year: ρ = 0.058
Design_Duration_Days — Project_Year: ρ = 0.044
9. Linear Regression
Business Justification: Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a predictive model that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: “Given a project’s building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?” Each regression coefficient translates directly into a pricing lever.
9.1 Model Specification
We model log-transformed project value to address the right-skewness identified in the EDA. The predictors are:
Design_Duration_Days— length of design phase (continuous)Revision_Count— number of design iterations (continuous)Project_Type— building type (categorical: Commercial, Industrial, Residential)Discipline— scope of MEP services (categorical)Project_Year— year initiated (continuous, captures inflation/growth)
View Code
# Fit the model on design-complete observations
model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count +
Project_Type + Discipline + Project_Year,
data = df_design)
# Model summary
model_summary <- summary(model)
model_tidy <- tidy(model, conf.int = TRUE)
model_glance <- glance(model)
cat(sprintf("**Model Fit:**\n\n- R² = %.3f (Adjusted R² = %.3f)\n- F(%d, %d) = %.2f, p = %.2e\n- Residual SE = %.3f\n- Observations: %d\n",
model_glance$r.squared, model_glance$adj.r.squared,
model_glance$df, model_glance$df.residual,
model_glance$statistic, model_glance$p.value,
model_glance$sigma, nrow(df_design)))**Model Fit:**
- R² = 0.469 (Adjusted R² = 0.421)
- F(7, 78) = 9.83, p = 1.00e-08
- Residual SE = 0.947
- Observations: 86
9.2 Coefficient Table
View Code
model_tidy %>%
mutate(
across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)),
p.value = format.pval(p.value, digits = 3),
Significance = case_when(
as.numeric(p.value) < 0.001 ~ "***",
as.numeric(p.value) < 0.01 ~ "**",
as.numeric(p.value) < 0.05 ~ "*",
as.numeric(p.value) < 0.1 ~ ".",
TRUE ~ ""
)
) %>%
select(Term = term, Estimate = estimate, `Std. Error` = std.error,
`t-value` = statistic, `p-value` = p.value, Significance,
`95% CI Low` = conf.low, `95% CI High` = conf.high) %>%
kbl(caption = "Regression Coefficients — log(Project Value)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Term | Estimate | Std. Error | t-value | p-value | Significance | 95% CI Low | 95% CI High |
|---|---|---|---|---|---|---|---|
| (Intercept) | -174.4380 | 55.7227 | -3.1305 | 0.00246 | ** | -285.3735 | -63.5025 |
| Design_Duration_Days | 0.0027 | 0.0014 | 1.9967 | 0.04935 | * | 0.0000 | 0.0054 |
| Revision_Count | 0.3238 | 0.0693 | 4.6744 | 1.21e-05 | *** | 0.1859 | 0.4616 |
| Project_TypeIndustrial | 0.6776 | 0.4196 | 1.6149 | 0.11037 | -0.1577 | 1.5129 | |
| Project_TypeResidential | -0.1796 | 0.2590 | -0.6937 | 0.48993 | -0.6952 | 0.3359 | |
| DisciplineM Only | -0.1583 | 0.7053 | -0.2245 | 0.82298 | -1.5624 | 1.2458 | |
| DisciplineMEP | 0.3953 | 0.4408 | 0.8967 | 0.37264 | -0.4823 | 1.2729 | |
| Project_Year | 0.0941 | 0.0276 | 3.4113 | 0.00103 | ** | 0.0392 | 0.1490 |
9.3 Coefficient Plot
View Code
model_tidy %>%
filter(term != "(Intercept)") %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") +
geom_point(colour = "#00d4ff", size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
height = 0.2, colour = "#00d4ff", linewidth = 0.8) +
labs(title = "Regression Coefficient Estimates",
subtitle = "log(Project Value) model — points show estimate, bars show 95% CI",
x = "Estimate (log scale)", y = NULL)9.4 Diagnostic Plots
View Code
par(mfrow = c(2, 2),
bg = "#161b22", fg = "#c9d1d9",
col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc")
plot(model, col = "#00d4ff", pch = 16, cex = 0.8,
sub.caption = "")View Code
# Variance Inflation Factors
vif_vals <- car::vif(model)
vif_df <- if (is.matrix(vif_vals)) {
data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2),
Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2))
} else {
data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2))
}
vif_df %>%
kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")| Term | GVIF | Df | GVIF..1.2Df. | |
|---|---|---|---|---|
| Design_Duration_Days | Design_Duration_Days | 1.46 | 1 | 1.21 |
| Revision_Count | Revision_Count | 1.42 | 1 | 1.19 |
| Project_Type | Project_Type | 1.19 | 2 | 1.05 |
| Discipline | Discipline | 1.07 | 2 | 1.02 |
| Project_Year | Project_Year | 1.05 | 1 | 1.03 |
9.5 Business Interpretation
Interpreting Key Coefficients (on the log scale → percentage change in value):
- Design Duration: Each additional design day is associated with a 0.3%% change in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦1.4M in additional value.
- Revision Count: Each additional revision is associated with a 38.2%% change in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.
- Project Year: The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.
View Code
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Prepare data
df_reg_py = df_design_py.copy()
df_reg_py['log_value'] = np.log(df_reg_py['Project_Value'])
# Fit OLS model
formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year'
model_py = smf.ols(formula, data=df_reg_py).fit()
print(model_py.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_value R-squared: 0.469
Model: OLS Adj. R-squared: 0.421
Method: Least Squares F-statistic: 9.829
Date: Thu, 21 May 2026 Prob (F-statistic): 1.00e-08
Time: 21:26:58 Log-Likelihood: -113.19
No. Observations: 86 AIC: 242.4
Df Residuals: 78 BIC: 262.0
Df Model: 7
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept -174.4380 55.723 -3.130 0.002 -285.373 -63.503
C(Project_Type)[T.Industrial] 0.6776 0.420 1.615 0.110 -0.158 1.513
C(Project_Type)[T.Residential] -0.1796 0.259 -0.694 0.490 -0.695 0.336
C(Discipline)[T.M Only] -0.1583 0.705 -0.224 0.823 -1.562 1.246
C(Discipline)[T.MEP] 0.3953 0.441 0.897 0.373 -0.482 1.273
Design_Duration_Days 0.0027 0.001 1.997 0.049 7.91e-06 0.005
Revision_Count 0.3238 0.069 4.674 0.000 0.186 0.462
Project_Year 0.0941 0.028 3.411 0.001 0.039 0.149
==============================================================================
Omnibus: 22.945 Durbin-Watson: 1.734
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.364
Skew: -0.998 Prob(JB): 3.83e-10
Kurtosis: 5.849 Cond. No. 1.10e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
View Code
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Residuals vs Fitted
fitted = model_py.fittedvalues
resids = model_py.resid
axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none')
axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7)
axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
# Q-Q Plot
from scipy.stats import probplot
probplot(resids, plot=axes[1])((array([-2.40766479, -2.06458227, -1.86543774, -1.72085246, -1.60526667,
-1.50785103, -1.42295159, -1.34722519, -1.27852253, -1.21537585,
-1.15673583, -1.10182483, -1.05004948, -1.00094605, -0.95414466,
-0.90934518, -0.86630034, -0.82480377, -0.78468115, -0.74578374,
-0.70798335, -0.67116861, -0.63524191, -0.60011712, -0.56571768,
-0.5319751 , -0.4988277 , -0.46621963, -0.4341 , -0.40242217,
-0.37114319, -0.34022324, -0.30962526, -0.27931451, -0.24925828,
-0.2194256 , -0.18978698, -0.16031418, -0.13098001, -0.10175815,
-0.07262293, -0.04354925, -0.01451234, 0.01451234, 0.04354925,
0.07262293, 0.10175815, 0.13098001, 0.16031418, 0.18978698,
0.2194256 , 0.24925828, 0.27931451, 0.30962526, 0.34022324,
0.37114319, 0.40242217, 0.4341 , 0.46621963, 0.4988277 ,
0.5319751 , 0.56571768, 0.60011712, 0.63524191, 0.67116861,
0.70798335, 0.74578374, 0.78468115, 0.82480377, 0.86630034,
0.90934518, 0.95414466, 1.00094605, 1.05004948, 1.10182483,
1.15673583, 1.21537585, 1.27852253, 1.34722519, 1.42295159,
1.50785103, 1.60526667, 1.72085246, 1.86543774, 2.06458227,
2.40766479]), array([-3.71181264, -2.97765747, -1.40609497, -1.31520113, -1.25835835,
-1.22678431, -1.13513078, -1.0351474 , -1.0235931 , -1.01536067,
-0.97420614, -0.95285997, -0.94940802, -0.92731611, -0.77662532,
-0.70334381, -0.66007517, -0.54269396, -0.53889212, -0.53781839,
-0.52129653, -0.50797382, -0.45193639, -0.36821322, -0.36766873,
-0.30886751, -0.27825483, -0.27412426, -0.24304611, -0.23282302,
-0.20944849, -0.19278054, -0.1890824 , -0.16664233, -0.14477805,
-0.13286754, -0.11872129, -0.10960702, -0.0848119 , -0.03787557,
-0.03701624, 0.00497819, 0.03015306, 0.05752472, 0.10371569,
0.1105615 , 0.11442462, 0.12100739, 0.13325591, 0.14791516,
0.16850726, 0.19544901, 0.19933152, 0.21147898, 0.21777502,
0.22983991, 0.2683272 , 0.32502825, 0.35102551, 0.3596311 ,
0.39917961, 0.44515726, 0.4755828 , 0.48768985, 0.58993662,
0.60190576, 0.72537772, 0.81211412, 0.8441547 , 0.94382601,
0.95685372, 0.95910978, 0.96724439, 0.99003578, 0.99332189,
1.01658214, 1.07821001, 1.11357423, 1.1207581 , 1.15071333,
1.24243218, 1.26725314, 1.33259159, 1.36082603, 1.55668154,
1.86517327])), (np.float64(0.8906723212815882), np.float64(4.395768217217487e-13), np.float64(0.9646639085664399)))
View Code
axes[1].get_lines()[0].set_color('#00d4ff')
axes[1].get_lines()[0].set_markersize(5)
axes[1].get_lines()[1].set_color('#ff006e')
axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold')
plt.tight_layout()
plt.show()10. Integrated Findings
The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants’ project portfolio and the drivers of project value:
1. The Portfolio is Concentrated and Skewed. EDA revealed that the firm’s revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact).
2. Commercial Projects Are Statistically More Valuable. Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification.
3. Design Complexity Predicts Value. Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope.
4. Full-Scope MEP Engagements Outperform. Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts.
CA Consultants should anchor its growth strategy on full-scope MEP design for commercial buildings, using the regression model’s coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect.
11. Limitations & Further Work
Data limitations:
- The dataset contains 100 observations spanning 20 years. While this represents the firm’s full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse.
- Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to 86 observations.
- Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling.
- The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value.
Methodological limitations:
- Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives.
- The regression model uses cross-sectional data without controls for inflation or market conditions; the
Project_Yearvariable is a rough proxy. - Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence.
Future work with more data, time, or computing power:
- Collect client industry codes and location data (mainland Lagos vs. island) to model geographic and sectoral pricing effects.
- Build a time-series model of project pipeline values to forecast quarterly revenue.
- Introduce design team size and junior/senior engineer ratio as predictors to model the labour-cost side of project economics.
- Develop an interactive dashboard (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Mba, C. (2026). CA Consultants Limited MEP project register (2006–2026) [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com
Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 (R package). https://CRAN.R-project.org/package=ggcorrplot
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/
McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedersen, T. L. (2024). patchwork: The composer of plots (R package). https://CRAN.R-project.org/package=patchwork
Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package). https://CRAN.R-project.org/package=rstatix
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
Appendix: AI Usage Statement
This submission was prepared with the assistance of Claude (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality).
All analytical decisions were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants’ internal systems. The author can explain every line of code and every result in this document.