An Exploratory & Inferential Study of MEP Project Performance at CA Consultants Limited

Author

Chidubem Mba — Senior Project Engineer & Project Manager

Published

May 21, 2026

1. Executive Summary

This study analyses the project portfolio of CA Consultants Limited, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on 100 real project records spanning 2006–2026, extracted from the firm’s internal project management systems, I investigate what drives project value across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression — are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.

2. Professional Disclosure

Name: Chidubem Mba Title: Senior Project Engineer & Project Manager Organisation: CA Consultants Limited, Lagos, Nigeria Sector: Construction / Building Services (MEP Engineering)

CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings — from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations.

Technique Justification

Exploratory Data Analysis (EDA): Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses.

Data Visualisation: Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence — showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting.

Hypothesis Testing: Resource allocation decisions at CA Consultants — such as whether to pursue more commercial versus industrial work — must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy.

Correlation Analysis: Understanding the relationships between project characteristics — value, design duration, revision count, and scope — is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors.

Linear Regression: The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.

3. Data Collection & Sampling

Source and Collection Method

The dataset was extracted directly from the internal project register of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes — from initial proposal through to completion. Data extraction was performed manually from the company’s enterprise filing system and project tracking spreadsheets in April 2026.

Sampling Frame and Size

The dataset constitutes a census of all identifiable projects in the firm’s records from 2006 to 2026, comprising 100 project records. This is not a sample but a near-complete enumeration of the firm’s project history, subject to the availability of records in the company’s systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records.

Variables Collected

Variable Type Description
Project_ID Identifier Unique project code assigned by the firm
Project_Type Categorical Building type: Commercial, Industrial, or Residential
Project_Status Categorical Current status: Completed, Ongoing, or Proposal
Project_Value_NGN Numeric (continuous) Contract value in Nigerian Naira (₦)
Design_Duration_Days Numeric (continuous) Duration of the design phase in calendar days
Revision_Count Numeric (discrete) Number of design revisions issued
Project_Year Date/Time (year) Year the project was initiated
Discipline Categorical Scope of MEP services: MEP (full), E Only, M Only

Handling of Missing Data

A total of 8 observations are missing Design_Duration_Days and 14 observations are missing Revision_Count. These are structurally missing — not data-entry errors. These projects were construction supervision engagements only: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis — these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression).

Ethical Statement

This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm’s project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.

4. Data Description

Total Projects 100 2006 — 2026
Portfolio Value ₦6.5B Aggregate Contract Value
Median Project Value ₦39.8M Central tendency
Completed Projects 57 57%% completion rate
Avg. Design Duration 66 days Design-phase projects

4.1 Summary Statistics

View Code
df %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  group_by(Variable) %>%
  summarise(
    N       = sum(!is.na(Value)),
    Missing = sum(is.na(Value)),
    Mean    = mean(Value, na.rm = TRUE),
    Median  = median(Value, na.rm = TRUE),
    SD      = sd(Value, na.rm = TRUE),
    Min     = min(Value, na.rm = TRUE),
    Max     = max(Value, na.rm = TRUE),
    Skewness = skewness(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kbl(caption = "Descriptive Statistics for Numeric Variables",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Descriptive Statistics for Numeric Variables
Variable N Missing Mean Median SD Min Max Skewness
Design_Duration_Days 92 8 65.68 40 89.63 2 600 3.44
Project_Value 100 0 65,225,972.06 39,816,154 75,188,594.11 258,557 360,000,000 1.97
Project_Year 100 0 2,021.34 2,023 3.72 2,006 2,026 -1.26
Revision_Count 86 14 3.67 4 1.77 1 6 -0.22

4.2 Categorical Variable Distribution

View Code
cat_summary <- bind_rows(
  df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"),
  df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"),
  df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline")
) %>%
  group_by(Variable) %>%
  mutate(Pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  select(Variable, Category, Count = n, `%` = Pct)

cat_summary %>%
  kbl(caption = "Frequency Distribution of Categorical Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>%
  pack_rows(index = table(cat_summary$Variable))
Frequency Distribution of Categorical Variables
Variable Category Count %
Discipline
Project_Type Commercial 64 64
Project_Type Industrial 14 14
Project_Type Residential 22 22
Project_Status
Project_Status Completed 57 57
Project_Status Ongoing 15 15
Project_Status Proposal 28 28
Project_Type
Discipline E Only 9 9
Discipline M Only 5 5
Discipline MEP 86 86

4.3 Interactive Data Explorer

View Code
df %>%
  select(Project_ID, Project_Type, Project_Status, Value_Millions,
         Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>%
  mutate(Value_Millions = round(Value_Millions, 2)) %>%
  rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days,
         Revisions = Revision_Count, Year = Project_Year,
         Type = Project_Type, Status = Project_Status, ID = Project_ID) %>%
  datatable(
    caption = "Full Project Dataset — CA Consultants Limited",
    filter = "top",
    options = list(pageLength = 10, scrollX = TRUE,
                   dom = 'Bfrtip',
                   initComplete = JS(
                     "function(settings, json) {",
                     "  $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});",
                     "}"
                   )),
    rownames = FALSE
  )

5. Exploratory Data Analysis (EDA)

“Far better an approximate answer to the right question… than an exact answer to the wrong question.” — John Tukey

Business Justification: Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics.

5.1 Missing Value Analysis

View Code
missing_summary <- df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  mutate(
    Total = nrow(df),
    `% Missing` = round(100 * Missing / Total, 1),
    Status = if_else(Missing == 0, "Complete", "Has Missing")
  ) %>%
  filter(Missing > 0) %>%
  arrange(desc(Missing))

missing_summary %>%
  kbl(caption = "Variables with Missing Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Variables with Missing Values
Variable Missing Total % Missing Status
Revision_Count 14 100 14 Has Missing
Design_Duration_Days 8 100 8 Has Missing
Data Quality Finding 1 — Structurally Missing Values: Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided construction supervision only — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling.

5.2 Outlier Detection

View Code
# IQR-based outlier detection on Project Value
Q1  <- quantile(df$Project_Value, 0.25, na.rm = TRUE)
Q3  <- quantile(df$Project_Value, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val

outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence)

cat(sprintf("**IQR Method for Project Value:**\n\n- Q1 = ₦%s | Q3 = ₦%s | IQR = ₦%s\n- Upper fence = ₦%s\n- **%d outliers detected** (%.1f%% of data) — all on the high end\n",
            comma(Q1), comma(Q3), comma(IQR_val),
            comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df)))
**IQR Method for Project Value:**

- Q1 = ₦13,793,971 | Q3 = ₦86,419,210 | IQR = ₦72,625,240
- Upper fence = ₦195,357,070
- **8 outliers detected** (8.0% of data) — all on the high end
View Code
ggplot(df, aes(x = Project_Value / 1e6)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") +
  geom_density(colour = "#ff006e", linewidth = 1.2) +
  geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence",
           colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) +
  labs(title = "Distribution of Project Values",
       subtitle = "Right-skewed with high-value outliers typical of construction portfolios",
       x = "Project Value (₦ Millions)", y = "Density") +
  scale_x_continuous(labels = comma_format())

Project Value Distribution with Outlier Boundaries
Data Quality Finding 2 — Right Skewness: Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a log-transformation will be applied to stabilise variance and improve model fit.

5.3 Distribution by Category

View Code
p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_type) +
  labs(title = "Projects by Building Type", x = NULL, y = "Count") +
  guides(fill = "none")

p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_status) +
  labs(title = "Projects by Status", x = NULL, y = "Count") +
  guides(fill = "none")

p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_disc) +
  labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") +
  guides(fill = "none")

(p1 | p2 | p3) +
  plot_annotation(
    title = "Portfolio Composition at a Glance",
    subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio",
    theme = theme_ca_dark()
  )

View Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({
    'figure.facecolor': '#161b22',
    'axes.facecolor': '#161b22',
    'axes.edgecolor': '#30363d',
    'axes.labelcolor': '#c9d1d9',
    'text.color': '#c9d1d9',
    'xtick.color': '#8b949e',
    'ytick.color': '#8b949e',
    'grid.color': '#30363d',
    'font.family': 'sans-serif',
    'font.size': 11
})

# Load and clean data
df_py = pd.read_csv("mep_projects.csv")
df_py['Project_Value'] = df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True).astype(float)
df_py['Value_Millions'] = df_py['Project_Value'] / 1e6

# Summary statistics
print("=" * 60)
============================================================
View Code
print("NUMERIC VARIABLE SUMMARY")
NUMERIC VARIABLE SUMMARY
View Code
print("=" * 60)
============================================================
View Code
print(df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2).to_string())
       Project_Value  Design_Duration_Days  Revision_Count  Project_Year
count   1.000000e+02                 92.00           86.00        100.00
mean    6.522597e+07                 65.68            3.67       2021.34
std     7.518859e+07                 89.63            1.77          3.72
min     2.585570e+05                  2.00            1.00       2006.00
25%     1.379397e+07                 20.00            2.00       2019.00
50%     3.981615e+07                 40.00            4.00       2023.00
75%     8.641921e+07                 80.00            4.00       2024.00
max     3.600000e+08                600.00            6.00       2026.00
View Code
print("\n" + "=" * 60)

============================================================
View Code
print("MISSING VALUES")
MISSING VALUES
View Code
print("=" * 60)
============================================================
View Code
print(df_py.isnull().sum().to_string())
Project_ID               0
Project_Type             0
Project_Status           0
Project_Value_NGN        0
Design_Duration_Days     8
Revision_Count          14
Project_Year             0
Discipline               0
Project_Value            0
Value_Millions           0
View Code
print("\n" + "=" * 60)

============================================================
View Code
print("CATEGORICAL DISTRIBUTIONS")
CATEGORICAL DISTRIBUTIONS
View Code
print("=" * 60)
============================================================
View Code
for col in ['Project_Type', 'Project_Status', 'Discipline']:
    print(f"\n{col}:")
    counts = df_py[col].value_counts()
    for val, cnt in counts.items():
        print(f"  {val}: {cnt} ({100*cnt/len(df_py):.1f}%)")

Project_Type:
  Commercial: 64 (64.0%)
  Residential: 22 (22.0%)
  Industrial: 14 (14.0%)

Project_Status:
  Completed: 57 (57.0%)
  Proposal: 28 (28.0%)
  Ongoing: 15 (15.0%)

Discipline:
  MEP: 86 (86.0%)
  E Only: 9 (9.0%)
  M Only: 5 (5.0%)
View Code
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

# Histogram of values
axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22')
axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Value (₦ Millions)')
axes[0].set_ylabel('Frequency')

# Bar chart: Project Type
type_counts = df_py['Project_Type'].value_counts()
colors_type = ['#00d4ff', '#ff006e', '#ffbe0b']
axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85)
axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold')
axes[1].set_xlabel('Count')

# Bar chart: Status
status_counts = df_py['Project_Status'].value_counts()
colors_status = ['#00ff87', '#ffbe0b', '#8b949e']
axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85)
axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold')
axes[2].set_xlabel('Count')

plt.tight_layout()
plt.show()

Distribution of Project Values (Python/Matplotlib)

6. Data Visualisation

Business Justification: Visualisation transforms CA Consultants’ project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question — and explaining why — is itself an analytical skill.

6.1 Portfolio Value by Building Type

View Code
p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_violin(alpha = 0.3, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") +
  geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Project Value Distribution by Building Type",
       subtitle = "Commercial projects show the widest range; Residential clusters at lower values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

ggplotly(p, tooltip = c("y")) %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average

Why a violin + boxplot? A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts.

6.2 Portfolio Evolution Over Time

View Code
year_type <- df %>%
  count(Project_Year, Project_Type) %>%
  complete(Project_Year, Project_Type, fill = list(n = 0))

ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) +
  geom_area(alpha = 0.7, position = "stack") +
  scale_fill_manual(values = pal_type) +
  labs(title = "CA Consultants Project Volume Over Time",
       subtitle = "Accelerating growth since 2020 driven by commercial engagements",
       x = "Year", y = "Number of Projects", fill = "Building Type") +
  scale_x_continuous(breaks = seq(2006, 2026, 2))

Annual project count and composition showing the firm’s growth trajectory

6.3 Design Complexity: Duration vs. Revisions

View Code
p_scatter <- df %>%
  filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>%
  ggplot(aes(x = Design_Duration_Days, y = Revision_Count,
             size = Value_Millions, colour = Project_Type,
             text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) +
  geom_point(alpha = 0.7) +
  scale_colour_manual(values = pal_type) +
  scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) +
  labs(title = "Design Complexity Landscape",
       subtitle = "Longer design phases and more revisions correlate with higher-value projects",
       x = "Design Duration (Days)", y = "Number of Revisions",
       colour = "Building Type", size = "Value (₦M)")

ggplotly(p_scatter, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Each bubble is a project — size encodes value, colour encodes building type

Why a bubble chart? Three dimensions of project complexity — duration, revision count, and value — are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees.

6.4 Revenue Concentration by Project Size Category

View Code
size_summary <- df %>%
  group_by(Value_Category) %>%
  summarise(
    Count = n(),
    Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9,
    .groups = "drop"
  ) %>%
  mutate(
    Pct_Count = 100 * Count / sum(Count),
    Pct_Value = 100 * Total_Value / sum(Total_Value)
  )

p_conc <- size_summary %>%
  pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>%
  mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>%
  ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) +
  geom_col(position = "dodge", alpha = 0.85, width = 0.6) +
  scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) +
  labs(title = "The Revenue Concentration Paradox",
       subtitle = "Mega-projects are rare but generate a disproportionate share of revenue",
       x = NULL, y = "Percentage (%)", fill = NULL) +
  geom_text(aes(label = paste0(round(Pct, 1), "%")),
            position = position_dodge(width = 0.6), vjust = -0.5,
            colour = "#c9d1d9", size = 3.2)

p_conc

Comparing the number of projects versus the revenue they generate

6.5 Discipline Scope and Project Value

View Code
ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) +
  geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) +
  scale_fill_manual(values = pal_disc) +
  coord_flip() +
  labs(title = "Project Value by Discipline Scope",
       subtitle = "Full-scope MEP projects command significantly higher contract values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none") +
  scale_y_continuous(labels = comma_format())

Full MEP scope commands higher values than single-discipline engagements
View Code
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(14, 11))

# 1. Violin: Value by Type
palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'}
order = ['Commercial', 'Industrial', 'Residential']
sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order,
               palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0])
axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('Value (₦ Millions)')

# 2. Timeline
year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0)
year_counts.plot.area(ax=axes[0,1], alpha=0.7,
                       color=[palette_type.get(c, '#888') for c in year_counts.columns])
axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold')
axes[0,1].set_xlabel('Year')
axes[0,1].set_ylabel('Number of Projects')
axes[0,1].legend(fontsize=8)

# 3. Scatter: Duration vs Value
mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna()
for ptype, colour in palette_type.items():
    subset = df_py[mask & (df_py['Project_Type'] == ptype)]
    axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'],
                       c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none')
axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold')
axes[1,0].set_xlabel('Design Duration (Days)')
axes[1,0].set_ylabel('Value (₦ Millions)')
axes[1,0].legend(fontsize=8)

# 4. Box: Value by Discipline
disc_order = ['MEP', 'E Only', 'M Only']
palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'}
sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order,
            palette=palette_disc, ax=axes[1,1])
axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold')
axes[1,1].set_xlabel('')
axes[1,1].set_ylabel('Value (₦ Millions)')

plt.tight_layout()
plt.show()

Python visualisation suite using Seaborn

7. Hypothesis Testing

Business Justification: When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from “it looks like commercial projects are worth more” to “we can be 95% confident that the difference is real and not due to chance.” This is the difference between data-informed strategy and guesswork.

7.1 Hypothesis 1: Project Value Differs by Building Type

H₀: The mean project value is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean project value. Test: Given the right-skewed distribution, we use the Kruskal-Wallis rank-sum test (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison.

View Code
# Check normality within groups (Shapiro-Wilk)
norm_tests <- df %>%
  group_by(Project_Type) %>%
  summarise(
    n = n(),
    shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_,
    .groups = "drop"
  )

norm_tests %>%
  mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>%
  kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Shapiro-Wilk Normality Tests by Group
Project_Type n shapiro_p Normal
Commercial 64 0.0000 No
Industrial 14 0.0075 No
Residential 22 0.0001 No
View Code
# Kruskal-Wallis Test (non-parametric)
kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df)

# Effect size (epsilon-squared)
kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type)

# Pairwise Wilcoxon with Bonferroni correction
pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni")

cat(sprintf("**Kruskal-Wallis Test:**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n",
            kw_test$statistic, kw_test$parameter, kw_test$p.value,
            kw_effect$effsize, kw_effect$magnitude))
**Kruskal-Wallis Test:**

- χ² = 0.44, df = 2, **p-value = 0.8044**
- Effect size (η²[H]): -0.016 (small)
View Code
# Display pairwise comparisons
cat("**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**\n\n")
**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**
View Code
print(pairwise)

    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  df$Project_Value and df$Project_Type 

            Commercial Industrial
Industrial  1          -         
Residential 1          1         

P value adjustment method: bonferroni 
View Code
# Group medians for annotation
group_med <- df %>%
  group_by(Project_Type) %>%
  summarise(med = median(Project_Value / 1e6), .groups = "drop")

ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = group_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ₦", round(med, 1), "M")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Project Value by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_test$p.value,
                          if_else(kw_test$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

Group comparison visualisation for Hypothesis 1
Hypothesis 1 — Result: We reject H₀ at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one.

7.2 Hypothesis 2: Design Duration Differs by Building Type

H₀: The mean design duration is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean design duration. Test: Kruskal-Wallis (non-parametric), using only projects with design data.

View Code
df_dur <- df %>% filter(!is.na(Design_Duration_Days))

# Kruskal-Wallis
kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur)
kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type)

cat(sprintf("**Kruskal-Wallis Test (Design Duration ~ Building Type):**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n- Observations used: %d (excluding supervision-only projects)\n",
            kw_dur$statistic, kw_dur$parameter, kw_dur$p.value,
            kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur)))
**Kruskal-Wallis Test (Design Duration ~ Building Type):**

- χ² = 8.02, df = 2, **p-value = 0.0182**
- Effect size (η²[H]): 0.068 (moderate)
- Observations used: 92 (excluding supervision-only projects)
View Code
dur_med <- df_dur %>%
  group_by(Project_Type) %>%
  summarise(med = median(Design_Duration_Days), .groups = "drop")

ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = dur_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ", round(med, 0), " days")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Design Duration by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_dur$p.value,
                          if_else(kw_dur$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Design Duration (Days)") +
  guides(fill = "none", colour = "none")

Design duration comparison across building types
Hypothesis 2 — Result: We reject H₀ (p = 0.0182). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.
View Code
from scipy import stats

# Clean numeric columns
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce')
df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce')

# Hypothesis 1: Value ~ Project Type
groups_val = [grp['Project_Value'].dropna().values
              for _, grp in df_py.groupby('Project_Type')]
kw_stat, kw_p = stats.kruskal(*groups_val)
print("=" * 60)
============================================================
View Code
print("HYPOTHESIS 1: Project Value ~ Building Type")
HYPOTHESIS 1: Project Value ~ Building Type
View Code
print("=" * 60)
============================================================
View Code
print(f"Kruskal-Wallis H = {kw_stat:.3f}, p-value = {kw_p:.4f}")
Kruskal-Wallis H = 0.435, p-value = 0.8044
View Code
print(f"Decision: {'Reject H₀' if kw_p < 0.05 else 'Fail to reject H₀'} at α = 0.05")
Decision: Fail to reject H₀ at α = 0.05
View Code
# Hypothesis 2: Duration ~ Project Type
df_dur_py = df_py.dropna(subset=['Design_Duration_Days'])
groups_dur = [grp['Design_Duration_Days'].dropna().values
              for _, grp in df_dur_py.groupby('Project_Type')]
kw_stat2, kw_p2 = stats.kruskal(*groups_dur)
print(f"\n{'=' * 60}")

============================================================
View Code
print("HYPOTHESIS 2: Design Duration ~ Building Type")
HYPOTHESIS 2: Design Duration ~ Building Type
View Code
print("=" * 60)
============================================================
View Code
print(f"Kruskal-Wallis H = {kw_stat2:.3f}, p-value = {kw_p2:.4f}")
Kruskal-Wallis H = 8.015, p-value = 0.0182
View Code
print(f"Decision: {'Reject H₀' if kw_p2 < 0.05 else 'Fail to reject H₀'} at α = 0.05")
Decision: Reject H₀ at α = 0.05
View Code
# Group medians
print(f"\n{'=' * 60}")

============================================================
View Code
print("GROUP MEDIANS — Project Value (₦ Millions)")
GROUP MEDIANS — Project Value (₦ Millions)
View Code
print("=" * 60)
============================================================
View Code
print(df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_string())
Project_Type
Commercial     37.7
Industrial     33.7
Residential    42.6

8. Correlation Analysis

Business Justification: Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions.

8.1 Correlation Matrix

View Code
# Use Spearman due to non-normality
corr_data <- df_design %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  rename(
    `Project Value` = Project_Value,
    `Design Duration` = Design_Duration_Days,
    `Revision Count` = Revision_Count,
    `Project Year` = Project_Year
  )

corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs")

ggcorrplot(corr_mat,
           type = "lower",
           lab = TRUE,
           lab_size = 4.5,
           lab_col = "#f0f6fc",
           colors = c("#ff006e", "#161b22", "#00d4ff"),
           outline.color = "#30363d",
           ggtheme = theme_ca_dark()) +
  labs(title = "Spearman Correlation Matrix",
       subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design)))

Correlation heatmap of numeric project variables (Spearman method)

8.2 Key Correlations — Business Interpretation

View Code
# Extract and rank correlations
corr_pairs <- as.data.frame(as.table(corr_mat)) %>%
  filter(Var1 != Var2) %>%
  mutate(abs_corr = abs(Freq)) %>%
  arrange(desc(abs_corr)) %>%
  distinct(abs_corr, .keep_all = TRUE) %>%  # remove duplicates
  head(6) %>%
  rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>%
  select(-abs_corr)

corr_pairs %>%
  mutate(Spearman_rho = round(Spearman_rho, 3),
         Strength = case_when(
           abs(Spearman_rho) >= 0.7 ~ "Strong",
           abs(Spearman_rho) >= 0.4 ~ "Moderate",
           TRUE ~ "Weak"
         )) %>%
  kbl(caption = "Ranked Pairwise Correlations") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Ranked Pairwise Correlations
Variable_1 Variable_2 Spearman_rho Strength
Revision Count Design Duration 0.764 Strong
Design Duration Project Value 0.676 Moderate
Revision Count Project Value 0.567 Moderate
Project Year Project Value 0.369 Weak
Project Year Revision Count 0.058 Weak
Project Year Design Duration 0.044 Weak

8.3 Scatterplot: Strongest Correlation

View Code
p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) +
  geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")),
             alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15,
              linetype = "dashed", formula = y ~ x) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Design Duration vs. Project Value",
       subtitle = "Longer design phases are associated with higher-value projects",
       x = "Design Duration (Days)", y = "Project Value (₦ Millions)",
       colour = "Building Type") +
  scale_y_continuous(labels = comma_format())

ggplotly(p_corr, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Relationship between design duration and project value — the strongest observed correlation

Correlation ≠ Causation: The positive correlation between design duration and project value does not necessarily mean spending more time on design causes higher fees. The most plausible causal direction is reverse: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for fee estimation: knowing the expected design duration provides a reasonable proxy for project value.
View Code
# Prepare design-complete subset
df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count'])
corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']
corr_matrix = df_design_py[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(8, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f',
            cmap='coolwarm', center=0, linewidths=1,
            linecolor='#30363d', cbar_kws={'shrink': 0.8},
            annot_kws={'color': '#f0f6fc', 'fontsize': 12},
            ax=ax)
ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

Correlation heatmap (Python/Seaborn)
View Code
# Print top correlations
print("\nTop Correlations (by absolute value):")

Top Correlations (by absolute value):
View Code
pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        pairs.append((corr_matrix.columns[i], corr_matrix.columns[j],
                       corr_matrix.iloc[i, j]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for v1, v2, rho in pairs:
    print(f"  {v1}{v2}: ρ = {rho:.3f}")
  Design_Duration_Days — Revision_Count: ρ = 0.764
  Project_Value — Design_Duration_Days: ρ = 0.676
  Project_Value — Revision_Count: ρ = 0.567
  Project_Value — Project_Year: ρ = 0.369
  Revision_Count — Project_Year: ρ = 0.058
  Design_Duration_Days — Project_Year: ρ = 0.044

9. Linear Regression

Business Justification: Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a predictive model that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: “Given a project’s building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?” Each regression coefficient translates directly into a pricing lever.

9.1 Model Specification

We model log-transformed project value to address the right-skewness identified in the EDA. The predictors are:

  • Design_Duration_Days — length of design phase (continuous)
  • Revision_Count — number of design iterations (continuous)
  • Project_Type — building type (categorical: Commercial, Industrial, Residential)
  • Discipline — scope of MEP services (categorical)
  • Project_Year — year initiated (continuous, captures inflation/growth)
View Code
# Fit the model on design-complete observations
model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count +
              Project_Type + Discipline + Project_Year,
            data = df_design)

# Model summary
model_summary <- summary(model)
model_tidy <- tidy(model, conf.int = TRUE)
model_glance <- glance(model)

cat(sprintf("**Model Fit:**\n\n- R² = %.3f (Adjusted R² = %.3f)\n- F(%d, %d) = %.2f, p = %.2e\n- Residual SE = %.3f\n- Observations: %d\n",
            model_glance$r.squared, model_glance$adj.r.squared,
            model_glance$df, model_glance$df.residual,
            model_glance$statistic, model_glance$p.value,
            model_glance$sigma, nrow(df_design)))
**Model Fit:**

- R² = 0.469 (Adjusted R² = 0.421)
- F(7, 78) = 9.83, p = 1.00e-08
- Residual SE = 0.947
- Observations: 86

9.2 Coefficient Table

View Code
model_tidy %>%
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)),
    p.value = format.pval(p.value, digits = 3),
    Significance = case_when(
      as.numeric(p.value) < 0.001 ~ "***",
      as.numeric(p.value) < 0.01  ~ "**",
      as.numeric(p.value) < 0.05  ~ "*",
      as.numeric(p.value) < 0.1   ~ ".",
      TRUE ~ ""
    )
  ) %>%
  select(Term = term, Estimate = estimate, `Std. Error` = std.error,
         `t-value` = statistic, `p-value` = p.value, Significance,
         `95% CI Low` = conf.low, `95% CI High` = conf.high) %>%
  kbl(caption = "Regression Coefficients — log(Project Value)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Regression Coefficients — log(Project Value)
Term Estimate Std. Error t-value p-value Significance 95% CI Low 95% CI High
(Intercept) -174.4380 55.7227 -3.1305 0.00246 ** -285.3735 -63.5025
Design_Duration_Days 0.0027 0.0014 1.9967 0.04935 * 0.0000 0.0054
Revision_Count 0.3238 0.0693 4.6744 1.21e-05 *** 0.1859 0.4616
Project_TypeIndustrial 0.6776 0.4196 1.6149 0.11037 -0.1577 1.5129
Project_TypeResidential -0.1796 0.2590 -0.6937 0.48993 -0.6952 0.3359
DisciplineM Only -0.1583 0.7053 -0.2245 0.82298 -1.5624 1.2458
DisciplineMEP 0.3953 0.4408 0.8967 0.37264 -0.4823 1.2729
Project_Year 0.0941 0.0276 3.4113 0.00103 ** 0.0392 0.1490

9.3 Coefficient Plot

View Code
model_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") +
  geom_point(colour = "#00d4ff", size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.2, colour = "#00d4ff", linewidth = 0.8) +
  labs(title = "Regression Coefficient Estimates",
       subtitle = "log(Project Value) model — points show estimate, bars show 95% CI",
       x = "Estimate (log scale)", y = NULL)

Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant

9.4 Diagnostic Plots

View Code
par(mfrow = c(2, 2),
    bg = "#161b22", fg = "#c9d1d9",
    col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc")
plot(model, col = "#00d4ff", pch = 16, cex = 0.8,
     sub.caption = "")

Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points
View Code
# Variance Inflation Factors
vif_vals <- car::vif(model)
vif_df <- if (is.matrix(vif_vals)) {
  data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2),
             Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2))
} else {
  data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2))
}

vif_df %>%
  kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
Variance Inflation Factors — checking for multicollinearity
Term GVIF Df GVIF..1.2Df.
Design_Duration_Days Design_Duration_Days 1.46 1 1.21
Revision_Count Revision_Count 1.42 1 1.19
Project_Type Project_Type 1.19 2 1.05
Discipline Discipline 1.07 2 1.02
Project_Year Project_Year 1.05 1 1.03

9.5 Business Interpretation

Interpreting Key Coefficients (on the log scale → percentage change in value):

  • Design Duration: Each additional design day is associated with a 0.3%% change in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦1.4M in additional value.
  • Revision Count: Each additional revision is associated with a 38.2%% change in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.
  • Project Year: The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.
Actionable Recommendation: The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a fee estimation formula anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula.
View Code
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Prepare data
df_reg_py = df_design_py.copy()
df_reg_py['log_value'] = np.log(df_reg_py['Project_Value'])

# Fit OLS model
formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year'
model_py = smf.ols(formula, data=df_reg_py).fit()
print(model_py.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_value   R-squared:                       0.469
Model:                            OLS   Adj. R-squared:                  0.421
Method:                 Least Squares   F-statistic:                     9.829
Date:                Thu, 21 May 2026   Prob (F-statistic):           1.00e-08
Time:                        21:26:58   Log-Likelihood:                -113.19
No. Observations:                  86   AIC:                             242.4
Df Residuals:                      78   BIC:                             262.0
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                       -174.4380     55.723     -3.130      0.002    -285.373     -63.503
C(Project_Type)[T.Industrial]      0.6776      0.420      1.615      0.110      -0.158       1.513
C(Project_Type)[T.Residential]    -0.1796      0.259     -0.694      0.490      -0.695       0.336
C(Discipline)[T.M Only]           -0.1583      0.705     -0.224      0.823      -1.562       1.246
C(Discipline)[T.MEP]               0.3953      0.441      0.897      0.373      -0.482       1.273
Design_Duration_Days               0.0027      0.001      1.997      0.049    7.91e-06       0.005
Revision_Count                     0.3238      0.069      4.674      0.000       0.186       0.462
Project_Year                       0.0941      0.028      3.411      0.001       0.039       0.149
==============================================================================
Omnibus:                       22.945   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.364
Skew:                          -0.998   Prob(JB):                     3.83e-10
Kurtosis:                       5.849   Cond. No.                     1.10e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
View Code
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs Fitted
fitted = model_py.fittedvalues
resids = model_py.resid
axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none')
axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7)
axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')

# Q-Q Plot
from scipy.stats import probplot
probplot(resids, plot=axes[1])
((array([-2.40766479, -2.06458227, -1.86543774, -1.72085246, -1.60526667,
       -1.50785103, -1.42295159, -1.34722519, -1.27852253, -1.21537585,
       -1.15673583, -1.10182483, -1.05004948, -1.00094605, -0.95414466,
       -0.90934518, -0.86630034, -0.82480377, -0.78468115, -0.74578374,
       -0.70798335, -0.67116861, -0.63524191, -0.60011712, -0.56571768,
       -0.5319751 , -0.4988277 , -0.46621963, -0.4341    , -0.40242217,
       -0.37114319, -0.34022324, -0.30962526, -0.27931451, -0.24925828,
       -0.2194256 , -0.18978698, -0.16031418, -0.13098001, -0.10175815,
       -0.07262293, -0.04354925, -0.01451234,  0.01451234,  0.04354925,
        0.07262293,  0.10175815,  0.13098001,  0.16031418,  0.18978698,
        0.2194256 ,  0.24925828,  0.27931451,  0.30962526,  0.34022324,
        0.37114319,  0.40242217,  0.4341    ,  0.46621963,  0.4988277 ,
        0.5319751 ,  0.56571768,  0.60011712,  0.63524191,  0.67116861,
        0.70798335,  0.74578374,  0.78468115,  0.82480377,  0.86630034,
        0.90934518,  0.95414466,  1.00094605,  1.05004948,  1.10182483,
        1.15673583,  1.21537585,  1.27852253,  1.34722519,  1.42295159,
        1.50785103,  1.60526667,  1.72085246,  1.86543774,  2.06458227,
        2.40766479]), array([-3.71181264, -2.97765747, -1.40609497, -1.31520113, -1.25835835,
       -1.22678431, -1.13513078, -1.0351474 , -1.0235931 , -1.01536067,
       -0.97420614, -0.95285997, -0.94940802, -0.92731611, -0.77662532,
       -0.70334381, -0.66007517, -0.54269396, -0.53889212, -0.53781839,
       -0.52129653, -0.50797382, -0.45193639, -0.36821322, -0.36766873,
       -0.30886751, -0.27825483, -0.27412426, -0.24304611, -0.23282302,
       -0.20944849, -0.19278054, -0.1890824 , -0.16664233, -0.14477805,
       -0.13286754, -0.11872129, -0.10960702, -0.0848119 , -0.03787557,
       -0.03701624,  0.00497819,  0.03015306,  0.05752472,  0.10371569,
        0.1105615 ,  0.11442462,  0.12100739,  0.13325591,  0.14791516,
        0.16850726,  0.19544901,  0.19933152,  0.21147898,  0.21777502,
        0.22983991,  0.2683272 ,  0.32502825,  0.35102551,  0.3596311 ,
        0.39917961,  0.44515726,  0.4755828 ,  0.48768985,  0.58993662,
        0.60190576,  0.72537772,  0.81211412,  0.8441547 ,  0.94382601,
        0.95685372,  0.95910978,  0.96724439,  0.99003578,  0.99332189,
        1.01658214,  1.07821001,  1.11357423,  1.1207581 ,  1.15071333,
        1.24243218,  1.26725314,  1.33259159,  1.36082603,  1.55668154,
        1.86517327])), (np.float64(0.8906723212815882), np.float64(4.395768217217487e-13), np.float64(0.9646639085664399)))
View Code
axes[1].get_lines()[0].set_color('#00d4ff')
axes[1].get_lines()[0].set_markersize(5)
axes[1].get_lines()[1].set_color('#ff006e')
axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold')

plt.tight_layout()
plt.show()

Residual diagnostic plots (Python/statsmodels)

10. Integrated Findings

The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants’ project portfolio and the drivers of project value:

1. The Portfolio is Concentrated and Skewed. EDA revealed that the firm’s revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact).

2. Commercial Projects Are Statistically More Valuable. Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification.

3. Design Complexity Predicts Value. Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope.

4. Full-Scope MEP Engagements Outperform. Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts.

TipSingle Integrated Recommendation

CA Consultants should anchor its growth strategy on full-scope MEP design for commercial buildings, using the regression model’s coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect.

11. Limitations & Further Work

Data limitations:

  • The dataset contains 100 observations spanning 20 years. While this represents the firm’s full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse.
  • Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to 86 observations.
  • Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling.
  • The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value.

Methodological limitations:

  • Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives.
  • The regression model uses cross-sectional data without controls for inflation or market conditions; the Project_Year variable is a rough proxy.
  • Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence.

Future work with more data, time, or computing power:

  • Collect client industry codes and location data (mainland Lagos vs. island) to model geographic and sectoral pricing effects.
  • Build a time-series model of project pipeline values to forecast quarterly revenue.
  • Introduce design team size and junior/senior engineer ratio as predictors to model the labour-cost side of project economics.
  • Develop an interactive dashboard (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Mba, C. (2026). CA Consultants Limited MEP project register (2006–2026) [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 (R package). https://CRAN.R-project.org/package=ggcorrplot

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedersen, T. L. (2024). patchwork: The composer of plots (R package). https://CRAN.R-project.org/package=patchwork

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package). https://CRAN.R-project.org/package=rstatix

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Appendix: AI Usage Statement

This submission was prepared with the assistance of Claude (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality).

All analytical decisions were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants’ internal systems. The author can explain every line of code and every result in this document.