Engineering Value

An Exploratory & Inferential Study of MEP Project Performance at CA Consultants Limited

Author

Chidubem Mba — Senior Project Engineer & Project Manager

Published

May 21, 2026

1. Executive Summary

This study analyses the project portfolio of CA Consultants Limited, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on 100 real project records spanning 2006–2026, extracted from the firm’s internal project management systems, I investigate what drives project value across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression — are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.

2. Professional Disclosure

Name: Chidubem Mba Title: Senior Project Engineer & Project Manager Organisation: CA Consultants Limited, Lagos, Nigeria Sector: Construction / Building Services (MEP Engineering)

CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings — from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations.

Technique Justification

Exploratory Data Analysis (EDA): Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses.

Data Visualisation: Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence — showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting.

Hypothesis Testing: Resource allocation decisions at CA Consultants — such as whether to pursue more commercial versus industrial work — must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy.

Correlation Analysis: Understanding the relationships between project characteristics — value, design duration, revision count, and scope — is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors.

Linear Regression: The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.

3. Data Collection & Sampling

Source and Collection Method

The dataset was extracted directly from the internal project register of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes — from initial proposal through to completion. Data extraction was performed manually from the company’s enterprise filing system and project tracking spreadsheets in April 2026.

Sampling Frame and Size

The dataset constitutes a census of all identifiable projects in the firm’s records from 2006 to 2026, comprising 100 project records. This is not a sample but a near-complete enumeration of the firm’s project history, subject to the availability of records in the company’s systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records.

Variables Collected

Variable	Type	Description
`Project_ID`	Identifier	Unique project code assigned by the firm
`Project_Type`	Categorical	Building type: Commercial, Industrial, or Residential
`Project_Status`	Categorical	Current status: Completed, Ongoing, or Proposal
`Project_Value_NGN`	Numeric (continuous)	Contract value in Nigerian Naira (₦)
`Design_Duration_Days`	Numeric (continuous)	Duration of the design phase in calendar days
`Revision_Count`	Numeric (discrete)	Number of design revisions issued
`Project_Year`	Date/Time (year)	Year the project was initiated
`Discipline`	Categorical	Scope of MEP services: MEP (full), E Only, M Only

Handling of Missing Data

A total of 8 observations are missing Design_Duration_Days and 14 observations are missing Revision_Count. These are structurally missing — not data-entry errors. These projects were construction supervision engagements only: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis — these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression).

Ethical Statement

This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm’s project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.

4. Data Description

Total Projects 100 2006 — 2026

Portfolio Value ₦6.5B Aggregate Contract Value

Median Project Value ₦39.8M Central tendency

Completed Projects 57 57%% completion rate

Avg. Design Duration 66 days Design-phase projects

4.1 Summary Statistics

View Code

df %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  group_by(Variable) %>%
  summarise(
    N       = sum(!is.na(Value)),
    Missing = sum(is.na(Value)),
    Mean    = mean(Value, na.rm = TRUE),
    Median  = median(Value, na.rm = TRUE),
    SD      = sd(Value, na.rm = TRUE),
    Min     = min(Value, na.rm = TRUE),
    Max     = max(Value, na.rm = TRUE),
    Skewness = skewness(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kbl(caption = "Descriptive Statistics for Numeric Variables",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Descriptive Statistics for Numeric Variables
Variable	N	Missing	Mean	Median	SD	Min	Max	Skewness
Design_Duration_Days	92	8	65.68	40	89.63	2	600	3.44
Project_Value	100	0	65,225,972.06	39,816,154	75,188,594.11	258,557	360,000,000	1.97
Project_Year	100	0	2,021.34	2,023	3.72	2,006	2,026	-1.26
Revision_Count	86	14	3.67	4	1.77	1	6	-0.22

4.2 Categorical Variable Distribution

View Code

cat_summary <- bind_rows(
  df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"),
  df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"),
  df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline")
) %>%
  group_by(Variable) %>%
  mutate(Pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  select(Variable, Category, Count = n, `%` = Pct)

cat_summary %>%
  kbl(caption = "Frequency Distribution of Categorical Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>%
  pack_rows(index = table(cat_summary$Variable))

Frequency Distribution of Categorical Variables
Variable	Category	Count	%
Discipline
Project_Type	Commercial	64	64
Project_Type	Industrial	14	14
Project_Type	Residential	22	22
Project_Status
Project_Status	Completed	57	57
Project_Status	Ongoing	15	15
Project_Status	Proposal	28	28
Project_Type
Discipline	E Only	9	9
Discipline	M Only	5	5
Discipline	MEP	86	86

4.3 Interactive Data Explorer

View Code

df %>%
  select(Project_ID, Project_Type, Project_Status, Value_Millions,
         Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>%
  mutate(Value_Millions = round(Value_Millions, 2)) %>%
  rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days,
         Revisions = Revision_Count, Year = Project_Year,
         Type = Project_Type, Status = Project_Status, ID = Project_ID) %>%
  datatable(
    caption = "Full Project Dataset — CA Consultants Limited",
    filter = "top",
    options = list(pageLength = 10, scrollX = TRUE,
                   dom = 'Bfrtip',
                   initComplete = JS(
                     "function(settings, json) {",
                     "  $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});",
                     "}"
                   )),
    rownames = FALSE
  )

5. Exploratory Data Analysis (EDA)

“Far better an approximate answer to the right question… than an exact answer to the wrong question.” — John Tukey

Business Justification: Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics.

5.1 Missing Value Analysis

View Code

missing_summary <- df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  mutate(
    Total = nrow(df),
    `% Missing` = round(100 * Missing / Total, 1),
    Status = if_else(Missing == 0, "Complete", "Has Missing")
  ) %>%
  filter(Missing > 0) %>%
  arrange(desc(Missing))

missing_summary %>%
  kbl(caption = "Variables with Missing Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Variables with Missing Values
Variable	Missing	Total	% Missing	Status
Revision_Count	14	100	14	Has Missing
Design_Duration_Days	8	100	8	Has Missing

Data Quality Finding 1 — Structurally Missing Values: Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided construction supervision only — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling.

5.2 Outlier Detection

View Code

# IQR-based outlier detection on Project Value
Q1  <- quantile(df$Project_Value, 0.25, na.rm = TRUE)
Q3  <- quantile(df$Project_Value, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val

outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence)

cat(sprintf("**IQR Method for Project Value:**\n\n- Q1 = ₦%s | Q3 = ₦%s | IQR = ₦%s\n- Upper fence = ₦%s\n- **%d outliers detected** (%.1f%% of data) — all on the high end\n",
            comma(Q1), comma(Q3), comma(IQR_val),
            comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df)))

**IQR Method for Project Value:**

- Q1 = ₦13,793,971 | Q3 = ₦86,419,210 | IQR = ₦72,625,240
- Upper fence = ₦195,357,070
- **8 outliers detected** (8.0% of data) — all on the high end

View Code

ggplot(df, aes(x = Project_Value / 1e6)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") +
  geom_density(colour = "#ff006e", linewidth = 1.2) +
  geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence",
           colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) +
  labs(title = "Distribution of Project Values",
       subtitle = "Right-skewed with high-value outliers typical of construction portfolios",
       x = "Project Value (₦ Millions)", y = "Density") +
  scale_x_continuous(labels = comma_format())

Project Value Distribution with Outlier Boundaries

Data Quality Finding 2 — Right Skewness: Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a log-transformation will be applied to stabilise variance and improve model fit.

5.3 Distribution by Category

View Code

p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_type) +
  labs(title = "Projects by Building Type", x = NULL, y = "Count") +
  guides(fill = "none")

p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_status) +
  labs(title = "Projects by Status", x = NULL, y = "Count") +
  guides(fill = "none")

p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_disc) +
  labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") +
  guides(fill = "none")

(p1 | p2 | p3) +
  plot_annotation(
    title = "Portfolio Composition at a Glance",
    subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio",
    theme = theme_ca_dark()
  )

View Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({
    'figure.facecolor': '#161b22',
    'axes.facecolor': '#161b22',
    'axes.edgecolor': '#30363d',
    'axes.labelcolor': '#c9d1d9',
    'text.color': '#c9d1d9',
    'xtick.color': '#8b949e',
    'ytick.color': '#8b949e',
    'grid.color': '#30363d',
    'font.family': 'sans-serif',
    'font.size': 11
})

# Load and clean data
df_py = pd.read_csv("mep_projects.csv")
df_py['Project_Value'] = df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True).astype(float)
df_py['Value_Millions'] = df_py['Project_Value'] / 1e6

# Summary statistics
print("=" * 60)

============================================================

View Code

print("NUMERIC VARIABLE SUMMARY")

NUMERIC VARIABLE SUMMARY

View Code

print("=" * 60)

============================================================

View Code

print(df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2).to_string())

       Project_Value  Design_Duration_Days  Revision_Count  Project_Year
count   1.000000e+02                 92.00           86.00        100.00
mean    6.522597e+07                 65.68            3.67       2021.34
std     7.518859e+07                 89.63            1.77          3.72
min     2.585570e+05                  2.00            1.00       2006.00
25%     1.379397e+07                 20.00            2.00       2019.00
50%     3.981615e+07                 40.00            4.00       2023.00
75%     8.641921e+07                 80.00            4.00       2024.00
max     3.600000e+08                600.00            6.00       2026.00

View Code

print("\n" + "=" * 60)


============================================================

View Code

print("MISSING VALUES")

MISSING VALUES

View Code

print("=" * 60)

============================================================

View Code

print(df_py.isnull().sum().to_string())

Project_ID               0
Project_Type             0
Project_Status           0
Project_Value_NGN        0
Design_Duration_Days     8
Revision_Count          14
Project_Year             0
Discipline               0
Project_Value            0
Value_Millions           0

View Code

print("\n" + "=" * 60)


============================================================

View Code

print("CATEGORICAL DISTRIBUTIONS")

CATEGORICAL DISTRIBUTIONS

View Code

print("=" * 60)

============================================================

View Code

for col in ['Project_Type', 'Project_Status', 'Discipline']:
    print(f"\n{col}:")
    counts = df_py[col].value_counts()
    for val, cnt in counts.items():
        print(f"  {val}: {cnt} ({100*cnt/len(df_py):.1f}%)")


Project_Type:
  Commercial: 64 (64.0%)
  Residential: 22 (22.0%)
  Industrial: 14 (14.0%)

Project_Status:
  Completed: 57 (57.0%)
  Proposal: 28 (28.0%)
  Ongoing: 15 (15.0%)

Discipline:
  MEP: 86 (86.0%)
  E Only: 9 (9.0%)
  M Only: 5 (5.0%)

View Code

fig, axes = plt.subplots(1, 3, figsize=(14, 5))

# Histogram of values
axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22')
axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Value (₦ Millions)')
axes[0].set_ylabel('Frequency')

# Bar chart: Project Type
type_counts = df_py['Project_Type'].value_counts()
colors_type = ['#00d4ff', '#ff006e', '#ffbe0b']
axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85)
axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold')
axes[1].set_xlabel('Count')

# Bar chart: Status
status_counts = df_py['Project_Status'].value_counts()
colors_status = ['#00ff87', '#ffbe0b', '#8b949e']
axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85)
axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold')
axes[2].set_xlabel('Count')

plt.tight_layout()
plt.show()

Distribution of Project Values (Python/Matplotlib)

6. Data Visualisation

Business Justification: Visualisation transforms CA Consultants’ project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question — and explaining why — is itself an analytical skill.

6.1 Portfolio Value by Building Type

View Code

p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_violin(alpha = 0.3, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") +
  geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Project Value Distribution by Building Type",
       subtitle = "Commercial projects show the widest range; Residential clusters at lower values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

ggplotly(p, tooltip = c("y")) %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average

Why a violin + boxplot? A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts.

6.2 Portfolio Evolution Over Time

View Code

year_type <- df %>%
  count(Project_Year, Project_Type) %>%
  complete(Project_Year, Project_Type, fill = list(n = 0))

ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) +
  geom_area(alpha = 0.7, position = "stack") +
  scale_fill_manual(values = pal_type) +
  labs(title = "CA Consultants Project Volume Over Time",
       subtitle = "Accelerating growth since 2020 driven by commercial engagements",
       x = "Year", y = "Number of Projects", fill = "Building Type") +
  scale_x_continuous(breaks = seq(2006, 2026, 2))

Annual project count and composition showing the firm’s growth trajectory

6.3 Design Complexity: Duration vs. Revisions

View Code

p_scatter <- df %>%
  filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>%
  ggplot(aes(x = Design_Duration_Days, y = Revision_Count,
             size = Value_Millions, colour = Project_Type,
             text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) +
  geom_point(alpha = 0.7) +
  scale_colour_manual(values = pal_type) +
  scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) +
  labs(title = "Design Complexity Landscape",
       subtitle = "Longer design phases and more revisions correlate with higher-value projects",
       x = "Design Duration (Days)", y = "Number of Revisions",
       colour = "Building Type", size = "Value (₦M)")

ggplotly(p_scatter, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Each bubble is a project — size encodes value, colour encodes building type

Why a bubble chart? Three dimensions of project complexity — duration, revision count, and value — are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees.

6.4 Revenue Concentration by Project Size Category

View Code

size_summary <- df %>%
  group_by(Value_Category) %>%
  summarise(
    Count = n(),
    Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9,
    .groups = "drop"
  ) %>%
  mutate(
    Pct_Count = 100 * Count / sum(Count),
    Pct_Value = 100 * Total_Value / sum(Total_Value)
  )

p_conc <- size_summary %>%
  pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>%
  mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>%
  ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) +
  geom_col(position = "dodge", alpha = 0.85, width = 0.6) +
  scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) +
  labs(title = "The Revenue Concentration Paradox",
       subtitle = "Mega-projects are rare but generate a disproportionate share of revenue",
       x = NULL, y = "Percentage (%)", fill = NULL) +
  geom_text(aes(label = paste0(round(Pct, 1), "%")),
            position = position_dodge(width = 0.6), vjust = -0.5,
            colour = "#c9d1d9", size = 3.2)

p_conc

Comparing the number of projects versus the revenue they generate

6.5 Discipline Scope and Project Value

View Code

ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) +
  geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) +
  scale_fill_manual(values = pal_disc) +
  coord_flip() +
  labs(title = "Project Value by Discipline Scope",
       subtitle = "Full-scope MEP projects command significantly higher contract values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none") +
  scale_y_continuous(labels = comma_format())

Full MEP scope commands higher values than single-discipline engagements

View Code

import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(14, 11))

# 1. Violin: Value by Type
palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'}
order = ['Commercial', 'Industrial', 'Residential']
sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order,
               palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0])
axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('Value (₦ Millions)')

# 2. Timeline
year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0)
year_counts.plot.area(ax=axes[0,1], alpha=0.7,
                       color=[palette_type.get(c, '#888') for c in year_counts.columns])
axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold')
axes[0,1].set_xlabel('Year')
axes[0,1].set_ylabel('Number of Projects')
axes[0,1].legend(fontsize=8)

# 3. Scatter: Duration vs Value
mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna()
for ptype, colour in palette_type.items():
    subset = df_py[mask & (df_py['Project_Type'] == ptype)]
    axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'],
                       c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none')
axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold')
axes[1,0].set_xlabel('Design Duration (Days)')
axes[1,0].set_ylabel('Value (₦ Millions)')
axes[1,0].legend(fontsize=8)

# 4. Box: Value by Discipline
disc_order = ['MEP', 'E Only', 'M Only']
palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'}
sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order,
            palette=palette_disc, ax=axes[1,1])
axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold')
axes[1,1].set_xlabel('')
axes[1,1].set_ylabel('Value (₦ Millions)')

plt.tight_layout()
plt.show()

Python visualisation suite using Seaborn

7. Hypothesis Testing

Business Justification: When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from “it looks like commercial projects are worth more” to “we can be 95% confident that the difference is real and not due to chance.” This is the difference between data-informed strategy and guesswork.

7.1 Hypothesis 1: Project Value Differs by Building Type

H₀: The mean project value is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean project value. Test: Given the right-skewed distribution, we use the Kruskal-Wallis rank-sum test (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison.

View Code

# Check normality within groups (Shapiro-Wilk)
norm_tests <- df %>%
  group_by(Project_Type) %>%
  summarise(
    n = n(),
    shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_,
    .groups = "drop"
  )

norm_tests %>%
  mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>%
  kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Shapiro-Wilk Normality Tests by Group
Project_Type	n	shapiro_p	Normal
Commercial	64	0.0000	No
Industrial	14	0.0075	No
Residential	22	0.0001	No

View Code

# Kruskal-Wallis Test (non-parametric)
kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df)

# Effect size (epsilon-squared)
kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type)

# Pairwise Wilcoxon with Bonferroni correction
pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni")

cat(sprintf("**Kruskal-Wallis Test:**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n",
            kw_test$statistic, kw_test$parameter, kw_test$p.value,
            kw_effect$effsize, kw_effect$magnitude))

**Kruskal-Wallis Test:**

- χ² = 0.44, df = 2, **p-value = 0.8044**
- Effect size (η²[H]): -0.016 (small)

View Code

# Display pairwise comparisons
cat("**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**\n\n")

**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**

View Code

print(pairwise)


    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  df$Project_Value and df$Project_Type 

            Commercial Industrial
Industrial  1          -         
Residential 1          1         

P value adjustment method: bonferroni

View Code

# Group medians for annotation
group_med <- df %>%
  group_by(Project_Type) %>%
  summarise(med = median(Project_Value / 1e6), .groups = "drop")

ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = group_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ₦", round(med, 1), "M")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Project Value by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_test$p.value,
                          if_else(kw_test$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

Group comparison visualisation for Hypothesis 1

Hypothesis 1 — Result: We reject H₀ at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one.

7.2 Hypothesis 2: Design Duration Differs by Building Type

H₀: The mean design duration is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean design duration. Test: Kruskal-Wallis (non-parametric), using only projects with design data.

View Code

df_dur <- df %>% filter(!is.na(Design_Duration_Days))

# Kruskal-Wallis
kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur)
kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type)

cat(sprintf("**Kruskal-Wallis Test (Design Duration ~ Building Type):**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n- Observations used: %d (excluding supervision-only projects)\n",
            kw_dur$statistic, kw_dur$parameter, kw_dur$p.value,
            kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur)))

**Kruskal-Wallis Test (Design Duration ~ Building Type):**

- χ² = 8.02, df = 2, **p-value = 0.0182**
- Effect size (η²[H]): 0.068 (moderate)
- Observations used: 92 (excluding supervision-only projects)

View Code

dur_med <- df_dur %>%
  group_by(Project_Type) %>%
  summarise(med = median(Design_Duration_Days), .groups = "drop")

ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = dur_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ", round(med, 0), " days")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Design Duration by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_dur$p.value,
                          if_else(kw_dur$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Design Duration (Days)") +
  guides(fill = "none", colour = "none")

Design duration comparison across building types

Hypothesis 2 — Result: We reject H₀ (p = 0.0182). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.

View Code

from scipy import stats

# Clean numeric columns
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce')
df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce')

# Hypothesis 1: Value ~ Project Type
groups_val = [grp['Project_Value'].dropna().values
              for _, grp in df_py.groupby('Project_Type')]
kw_stat, kw_p = stats.kruskal(*groups_val)
print("=" * 60)

============================================================

View Code

print("HYPOTHESIS 1: Project Value ~ Building Type")

HYPOTHESIS 1: Project Value ~ Building Type

View Code

print("=" * 60)

============================================================

View Code

print(f"Kruskal-Wallis H = {kw_stat:.3f}, p-value = {kw_p:.4f}")

Kruskal-Wallis H = 0.435, p-value = 0.8044

View Code

print(f"Decision: {'Reject H₀' if kw_p < 0.05 else 'Fail to reject H₀'} at α = 0.05")

Decision: Fail to reject H₀ at α = 0.05

View Code

# Hypothesis 2: Duration ~ Project Type
df_dur_py = df_py.dropna(subset=['Design_Duration_Days'])
groups_dur = [grp['Design_Duration_Days'].dropna().values
              for _, grp in df_dur_py.groupby('Project_Type')]
kw_stat2, kw_p2 = stats.kruskal(*groups_dur)
print(f"\n{'=' * 60}")


============================================================

View Code

print("HYPOTHESIS 2: Design Duration ~ Building Type")

HYPOTHESIS 2: Design Duration ~ Building Type

View Code

print("=" * 60)

============================================================

View Code

print(f"Kruskal-Wallis H = {kw_stat2:.3f}, p-value = {kw_p2:.4f}")

Kruskal-Wallis H = 8.015, p-value = 0.0182

View Code

print(f"Decision: {'Reject H₀' if kw_p2 < 0.05 else 'Fail to reject H₀'} at α = 0.05")

Decision: Reject H₀ at α = 0.05

View Code

# Group medians
print(f"\n{'=' * 60}")


============================================================

View Code

print("GROUP MEDIANS — Project Value (₦ Millions)")

GROUP MEDIANS — Project Value (₦ Millions)

View Code

print("=" * 60)

============================================================

View Code

print(df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_string())

Project_Type
Commercial     37.7
Industrial     33.7
Residential    42.6

8. Correlation Analysis

Business Justification: Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions.

8.1 Correlation Matrix

View Code

# Use Spearman due to non-normality
corr_data <- df_design %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  rename(
    `Project Value` = Project_Value,
    `Design Duration` = Design_Duration_Days,
    `Revision Count` = Revision_Count,
    `Project Year` = Project_Year
  )

corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs")

ggcorrplot(corr_mat,
           type = "lower",
           lab = TRUE,
           lab_size = 4.5,
           lab_col = "#f0f6fc",
           colors = c("#ff006e", "#161b22", "#00d4ff"),
           outline.color = "#30363d",
           ggtheme = theme_ca_dark()) +
  labs(title = "Spearman Correlation Matrix",
       subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design)))

Correlation heatmap of numeric project variables (Spearman method)

8.2 Key Correlations — Business Interpretation

View Code

# Extract and rank correlations
corr_pairs <- as.data.frame(as.table(corr_mat)) %>%
  filter(Var1 != Var2) %>%
  mutate(abs_corr = abs(Freq)) %>%
  arrange(desc(abs_corr)) %>%
  distinct(abs_corr, .keep_all = TRUE) %>%  # remove duplicates
  head(6) %>%
  rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>%
  select(-abs_corr)

corr_pairs %>%
  mutate(Spearman_rho = round(Spearman_rho, 3),
         Strength = case_when(
           abs(Spearman_rho) >= 0.7 ~ "Strong",
           abs(Spearman_rho) >= 0.4 ~ "Moderate",
           TRUE ~ "Weak"
         )) %>%
  kbl(caption = "Ranked Pairwise Correlations") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Ranked Pairwise Correlations
Variable_1	Variable_2	Spearman_rho	Strength
Revision Count	Design Duration	0.764	Strong
Design Duration	Project Value	0.676	Moderate
Revision Count	Project Value	0.567	Moderate
Project Year	Project Value	0.369	Weak
Project Year	Revision Count	0.058	Weak
Project Year	Design Duration	0.044	Weak

8.3 Scatterplot: Strongest Correlation

View Code

p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) +
  geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")),
             alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15,
              linetype = "dashed", formula = y ~ x) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Design Duration vs. Project Value",
       subtitle = "Longer design phases are associated with higher-value projects",
       x = "Design Duration (Days)", y = "Project Value (₦ Millions)",
       colour = "Building Type") +
  scale_y_continuous(labels = comma_format())

ggplotly(p_corr, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Relationship between design duration and project value — the strongest observed correlation

Correlation ≠ Causation: The positive correlation between design duration and project value does not necessarily mean spending more time on design causes higher fees. The most plausible causal direction is reverse: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for fee estimation: knowing the expected design duration provides a reasonable proxy for project value.

View Code

# Prepare design-complete subset
df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count'])
corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']
corr_matrix = df_design_py[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(8, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f',
            cmap='coolwarm', center=0, linewidths=1,
            linecolor='#30363d', cbar_kws={'shrink': 0.8},
            annot_kws={'color': '#f0f6fc', 'fontsize': 12},
            ax=ax)
ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

View Code

# Print top correlations
print("\nTop Correlations (by absolute value):")


Top Correlations (by absolute value):

View Code

pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        pairs.append((corr_matrix.columns[i], corr_matrix.columns[j],
                       corr_matrix.iloc[i, j]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for v1, v2, rho in pairs:
    print(f"  {v1} — {v2}: ρ = {rho:.3f}")

  Design_Duration_Days — Revision_Count: ρ = 0.764
  Project_Value — Design_Duration_Days: ρ = 0.676
  Project_Value — Revision_Count: ρ = 0.567
  Project_Value — Project_Year: ρ = 0.369
  Revision_Count — Project_Year: ρ = 0.058
  Design_Duration_Days — Project_Year: ρ = 0.044

9. Linear Regression

Business Justification: Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a predictive model that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: “Given a project’s building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?” Each regression coefficient translates directly into a pricing lever.

9.1 Model Specification

We model log-transformed project value to address the right-skewness identified in the EDA. The predictors are:

Design_Duration_Days — length of design phase (continuous)
Revision_Count — number of design iterations (continuous)
Project_Type — building type (categorical: Commercial, Industrial, Residential)
Discipline — scope of MEP services (categorical)
Project_Year — year initiated (continuous, captures inflation/growth)

View Code

# Fit the model on design-complete observations
model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count +
              Project_Type + Discipline + Project_Year,
            data = df_design)

# Model summary
model_summary <- summary(model)
model_tidy <- tidy(model, conf.int = TRUE)
model_glance <- glance(model)

cat(sprintf("**Model Fit:**\n\n- R² = %.3f (Adjusted R² = %.3f)\n- F(%d, %d) = %.2f, p = %.2e\n- Residual SE = %.3f\n- Observations: %d\n",
            model_glance$r.squared, model_glance$adj.r.squared,
            model_glance$df, model_glance$df.residual,
            model_glance$statistic, model_glance$p.value,
            model_glance$sigma, nrow(df_design)))

**Model Fit:**

- R² = 0.469 (Adjusted R² = 0.421)
- F(7, 78) = 9.83, p = 1.00e-08
- Residual SE = 0.947
- Observations: 86

9.2 Coefficient Table

View Code

model_tidy %>%
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)),
    p.value = format.pval(p.value, digits = 3),
    Significance = case_when(
      as.numeric(p.value) < 0.001 ~ "***",
      as.numeric(p.value) < 0.01  ~ "**",
      as.numeric(p.value) < 0.05  ~ "*",
      as.numeric(p.value) < 0.1   ~ ".",
      TRUE ~ ""
    )
  ) %>%
  select(Term = term, Estimate = estimate, `Std. Error` = std.error,
         `t-value` = statistic, `p-value` = p.value, Significance,
         `95% CI Low` = conf.low, `95% CI High` = conf.high) %>%
  kbl(caption = "Regression Coefficients — log(Project Value)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Regression Coefficients — log(Project Value)
Term	Estimate	Std. Error	t-value	p-value	Significance	95% CI Low	95% CI High
(Intercept)	-174.4380	55.7227	-3.1305	0.00246	**	-285.3735	-63.5025
Design_Duration_Days	0.0027	0.0014	1.9967	0.04935	*	0.0000	0.0054
Revision_Count	0.3238	0.0693	4.6744	1.21e-05	***	0.1859	0.4616
Project_TypeIndustrial	0.6776	0.4196	1.6149	0.11037		-0.1577	1.5129
Project_TypeResidential	-0.1796	0.2590	-0.6937	0.48993		-0.6952	0.3359
DisciplineM Only	-0.1583	0.7053	-0.2245	0.82298		-1.5624	1.2458
DisciplineMEP	0.3953	0.4408	0.8967	0.37264		-0.4823	1.2729
Project_Year	0.0941	0.0276	3.4113	0.00103	**	0.0392	0.1490

9.3 Coefficient Plot

View Code

model_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") +
  geom_point(colour = "#00d4ff", size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.2, colour = "#00d4ff", linewidth = 0.8) +
  labs(title = "Regression Coefficient Estimates",
       subtitle = "log(Project Value) model — points show estimate, bars show 95% CI",
       x = "Estimate (log scale)", y = NULL)

Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant

9.4 Diagnostic Plots

View Code

par(mfrow = c(2, 2),
    bg = "#161b22", fg = "#c9d1d9",
    col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc")
plot(model, col = "#00d4ff", pch = 16, cex = 0.8,
     sub.caption = "")

Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points

View Code

# Variance Inflation Factors
vif_vals <- car::vif(model)
vif_df <- if (is.matrix(vif_vals)) {
  data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2),
             Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2))
} else {
  data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2))
}

vif_df %>%
  kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Variance Inflation Factors — checking for multicollinearity
	Term	GVIF	Df	GVIF..1.2Df.
Design_Duration_Days	Design_Duration_Days	1.46	1	1.21
Revision_Count	Revision_Count	1.42	1	1.19
Project_Type	Project_Type	1.19	2	1.05
Discipline	Discipline	1.07	2	1.02
Project_Year	Project_Year	1.05	1	1.03

9.5 Business Interpretation

Interpreting Key Coefficients (on the log scale → percentage change in value):

Design Duration: Each additional design day is associated with a 0.3%% change in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦1.4M in additional value.
Revision Count: Each additional revision is associated with a 38.2%% change in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.
Project Year: The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.

Actionable Recommendation: The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a fee estimation formula anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula.

View Code

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Prepare data
df_reg_py = df_design_py.copy()
df_reg_py['log_value'] = np.log(df_reg_py['Project_Value'])

# Fit OLS model
formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year'
model_py = smf.ols(formula, data=df_reg_py).fit()
print(model_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_value   R-squared:                       0.469
Model:                            OLS   Adj. R-squared:                  0.421
Method:                 Least Squares   F-statistic:                     9.829
Date:                Thu, 21 May 2026   Prob (F-statistic):           1.00e-08
Time:                        21:26:58   Log-Likelihood:                -113.19
No. Observations:                  86   AIC:                             242.4
Df Residuals:                      78   BIC:                             262.0
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================================
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                       -174.4380     55.723     -3.130      0.002    -285.373     -63.503
C(Project_Type)[T.Industrial]      0.6776      0.420      1.615      0.110      -0.158       1.513
C(Project_Type)[T.Residential]    -0.1796      0.259     -0.694      0.490      -0.695       0.336
C(Discipline)[T.M Only]           -0.1583      0.705     -0.224      0.823      -1.562       1.246
C(Discipline)[T.MEP]               0.3953      0.441      0.897      0.373      -0.482       1.273
Design_Duration_Days               0.0027      0.001      1.997      0.049    7.91e-06       0.005
Revision_Count                     0.3238      0.069      4.674      0.000       0.186       0.462
Project_Year                       0.0941      0.028      3.411      0.001       0.039       0.149
==============================================================================
Omnibus:                       22.945   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.364
Skew:                          -0.998   Prob(JB):                     3.83e-10
Kurtosis:                       5.849   Cond. No.                     1.10e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

View Code

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs Fitted
fitted = model_py.fittedvalues
resids = model_py.resid
axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none')
axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7)
axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')

# Q-Q Plot
from scipy.stats import probplot
probplot(resids, plot=axes[1])

((array([-2.40766479, -2.06458227, -1.86543774, -1.72085246, -1.60526667,
       -1.50785103, -1.42295159, -1.34722519, -1.27852253, -1.21537585,
       -1.15673583, -1.10182483, -1.05004948, -1.00094605, -0.95414466,
       -0.90934518, -0.86630034, -0.82480377, -0.78468115, -0.74578374,
       -0.70798335, -0.67116861, -0.63524191, -0.60011712, -0.56571768,
       -0.5319751 , -0.4988277 , -0.46621963, -0.4341    , -0.40242217,
       -0.37114319, -0.34022324, -0.30962526, -0.27931451, -0.24925828,
       -0.2194256 , -0.18978698, -0.16031418, -0.13098001, -0.10175815,
       -0.07262293, -0.04354925, -0.01451234,  0.01451234,  0.04354925,
        0.07262293,  0.10175815,  0.13098001,  0.16031418,  0.18978698,
        0.2194256 ,  0.24925828,  0.27931451,  0.30962526,  0.34022324,
        0.37114319,  0.40242217,  0.4341    ,  0.46621963,  0.4988277 ,
        0.5319751 ,  0.56571768,  0.60011712,  0.63524191,  0.67116861,
        0.70798335,  0.74578374,  0.78468115,  0.82480377,  0.86630034,
        0.90934518,  0.95414466,  1.00094605,  1.05004948,  1.10182483,
        1.15673583,  1.21537585,  1.27852253,  1.34722519,  1.42295159,
        1.50785103,  1.60526667,  1.72085246,  1.86543774,  2.06458227,
        2.40766479]), array([-3.71181264, -2.97765747, -1.40609497, -1.31520113, -1.25835835,
       -1.22678431, -1.13513078, -1.0351474 , -1.0235931 , -1.01536067,
       -0.97420614, -0.95285997, -0.94940802, -0.92731611, -0.77662532,
       -0.70334381, -0.66007517, -0.54269396, -0.53889212, -0.53781839,
       -0.52129653, -0.50797382, -0.45193639, -0.36821322, -0.36766873,
       -0.30886751, -0.27825483, -0.27412426, -0.24304611, -0.23282302,
       -0.20944849, -0.19278054, -0.1890824 , -0.16664233, -0.14477805,
       -0.13286754, -0.11872129, -0.10960702, -0.0848119 , -0.03787557,
       -0.03701624,  0.00497819,  0.03015306,  0.05752472,  0.10371569,
        0.1105615 ,  0.11442462,  0.12100739,  0.13325591,  0.14791516,
        0.16850726,  0.19544901,  0.19933152,  0.21147898,  0.21777502,
        0.22983991,  0.2683272 ,  0.32502825,  0.35102551,  0.3596311 ,
        0.39917961,  0.44515726,  0.4755828 ,  0.48768985,  0.58993662,
        0.60190576,  0.72537772,  0.81211412,  0.8441547 ,  0.94382601,
        0.95685372,  0.95910978,  0.96724439,  0.99003578,  0.99332189,
        1.01658214,  1.07821001,  1.11357423,  1.1207581 ,  1.15071333,
        1.24243218,  1.26725314,  1.33259159,  1.36082603,  1.55668154,
        1.86517327])), (np.float64(0.8906723212815882), np.float64(4.395768217217487e-13), np.float64(0.9646639085664399)))

View Code

axes[1].get_lines()[0].set_color('#00d4ff')
axes[1].get_lines()[0].set_markersize(5)
axes[1].get_lines()[1].set_color('#ff006e')
axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold')

plt.tight_layout()
plt.show()

Residual diagnostic plots (Python/statsmodels)

10. Integrated Findings

The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants’ project portfolio and the drivers of project value:

1. The Portfolio is Concentrated and Skewed. EDA revealed that the firm’s revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact).

2. Commercial Projects Are Statistically More Valuable. Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification.

3. Design Complexity Predicts Value. Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope.

4. Full-Scope MEP Engagements Outperform. Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts.

Single Integrated Recommendation

CA Consultants should anchor its growth strategy on full-scope MEP design for commercial buildings, using the regression model’s coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect.

11. Limitations & Further Work

Data limitations:

The dataset contains 100 observations spanning 20 years. While this represents the firm’s full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse.
Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to 86 observations.
Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling.
The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value.

Methodological limitations:

Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives.
The regression model uses cross-sectional data without controls for inflation or market conditions; the Project_Year variable is a rough proxy.
Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence.

Future work with more data, time, or computing power:

Collect client industry codes and location data (mainland Lagos vs. island) to model geographic and sectoral pricing effects.
Build a time-series model of project pipeline values to forecast quarterly revenue.
Introduce design team size and junior/senior engineer ratio as predictors to model the labour-cost side of project economics.
Develop an interactive dashboard (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Mba, C. (2026). CA Consultants Limited MEP project register (2006–2026) [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 (R package). https://CRAN.R-project.org/package=ggcorrplot

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedersen, T. L. (2024). patchwork: The composer of plots (R package). https://CRAN.R-project.org/package=patchwork

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package). https://CRAN.R-project.org/package=rstatix

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Appendix: AI Usage Statement

This submission was prepared with the assistance of Claude (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality).

All analytical decisions were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants’ internal systems. The author can explain every line of code and every result in this document.

--- title: "Engineering Value" subtitle: "An Exploratory & Inferential Study of MEP Project Performance at CA Consultants Limited" author: "Chidubem Mba — Senior Project Engineer & Project Manager" date: today format: html: theme: darkly toc: true toc-depth: 3 toc-title: "Navigation" code-fold: true code-tools: true code-summary: "View Code" self-contained: true smooth-scroll: true fig-align: center fig-width: 10 fig-height: 6 execute: warning: false message: false echo: true --- ```{=html} <style> /* ===== CUSTOM DARK MODERN THEME ===== */ /* Google Fonts */ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap'); /* Root Variables */ :root { --bg-primary: #0d1117; --bg-secondary: #161b22; --bg-tertiary: #1c2333; --border-color: #30363d; --border-accent: #58a6ff; --text-primary: #f0f6fc; --text-secondary: #c9d1d9; --text-muted: #8b949e; --accent-cyan: #00d4ff; --accent-magenta: #ff006e; --accent-amber: #ffbe0b; --accent-purple: #7928ca; --accent-green: #00ff87; --gradient-main: linear-gradient(135deg, #00d4ff 0%, #7928ca 50%, #ff006e 100%); --gradient-warm: linear-gradient(135deg, #ff006e 0%, #ffbe0b 100%); --gradient-cool: linear-gradient(135deg, #00d4ff 0%, #00ff87 100%); --shadow-glow: 0 0 30px rgba(0, 212, 255, 0.15); --shadow-card: 0 8px 32px rgba(0, 0, 0, 0.4); } /* Global */ body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif !important; background: var(--bg-primary) !important; color: var(--text-secondary) !important; line-height: 1.7; letter-spacing: 0.01em; } /* Title Block */ .quarto-title-block { padding: 3rem 0 2rem 0; margin-bottom: 2rem; border-bottom: 2px solid transparent; border-image: var(--gradient-main) 1; } .quarto-title .title { font-family: 'Inter', sans-serif !important; font-weight: 800 !important; font-size: 2.8rem !important; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; line-height: 1.2 !important; margin-bottom: 0.5rem !important; } .quarto-title .subtitle { font-size: 1.1rem !important; color: var(--text-muted) !important; font-weight: 400 !important; letter-spacing: 0.05em; text-transform: uppercase; } /* Section Headings */ h1, .h1 { font-family: 'Inter', sans-serif !important; font-weight: 700 !important; font-size: 2rem !important; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; padding-top: 1.5rem; margin-top: 3rem !important; border-top: 1px solid var(--border-color); } h2, .h2 { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 1.5rem !important; color: var(--accent-cyan) !important; margin-top: 2rem !important; } h3, .h3 { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 1.2rem !important; color: var(--text-primary) !important; } /* Main Content Container */ #quarto-content { background: var(--bg-primary); } .content { max-width: 1100px; } /* Cards / Section Containers */ .card-section { background: var(--bg-secondary); border: 1px solid var(--border-color); border-radius: 12px; padding: 2rem; margin: 1.5rem 0; box-shadow: var(--shadow-card); transition: border-color 0.3s ease, box-shadow 0.3s ease; } .card-section:hover { border-color: var(--border-accent); box-shadow: var(--shadow-glow); } /* Metric Cards Grid */ .metric-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 1.2rem; margin: 2rem 0; } .metric-card { background: var(--bg-secondary); border: 1px solid var(--border-color); border-radius: 12px; padding: 1.5rem; text-align: center; position: relative; overflow: hidden; transition: all 0.3s ease; } .metric-card::before { content: ''; position: absolute; top: 0; left: 0; right: 0; height: 3px; background: var(--gradient-main); } .metric-card:hover { transform: translateY(-4px); box-shadow: var(--shadow-glow); border-color: var(--accent-cyan); } .metric-label { display: block; font-size: 0.8rem; text-transform: uppercase; letter-spacing: 0.1em; color: var(--text-muted); margin-bottom: 0.5rem; font-weight: 500; } .metric-value { display: block; font-size: 2rem; font-weight: 700; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; } .metric-sub { display: block; font-size: 0.85rem; color: var(--text-muted); margin-top: 0.3rem; } /* Callout Boxes */ .callout { background: var(--bg-secondary) !important; border-radius: 10px !important; border-left: 4px solid var(--accent-cyan) !important; box-shadow: var(--shadow-card) !important; } .callout-note { border-left-color: var(--accent-cyan) !important; } .callout-important { border-left-color: var(--accent-magenta) !important; } .callout-tip { border-left-color: var(--accent-green) !important; } /* Tab Styling */ .nav-tabs { border-bottom: 2px solid var(--border-color) !important; margin-bottom: 1.5rem; } .nav-tabs .nav-link { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 0.95rem !important; color: var(--text-muted) !important; border: none !important; padding: 0.8rem 1.5rem !important; border-radius: 8px 8px 0 0 !important; transition: all 0.3s ease !important; text-transform: uppercase; letter-spacing: 0.05em; } .nav-tabs .nav-link:hover { color: var(--accent-cyan) !important; background: rgba(0, 212, 255, 0.08) !important; } .nav-tabs .nav-link.active { color: var(--accent-cyan) !important; background: var(--bg-secondary) !important; border-bottom: 3px solid var(--accent-cyan) !important; } /* Code Blocks */ pre { background: var(--bg-primary) !important; border: 1px solid var(--border-color) !important; border-radius: 10px !important; padding: 1.2rem !important; font-family: 'JetBrains Mono', 'Fira Code', monospace !important; font-size: 0.85rem !important; box-shadow: inset 0 2px 8px rgba(0,0,0,0.3); } code { font-family: 'JetBrains Mono', 'Fira Code', monospace !important; } .code-fold-btn-container .btn { font-family: 'Inter', sans-serif !important; font-size: 0.8rem !important; color: var(--accent-cyan) !important; border-color: var(--border-color) !important; background: var(--bg-secondary) !important; border-radius: 6px !important; } /* Tables */ .table { background: var(--bg-secondary) !important; color: var(--text-secondary) !important; border-color: var(--border-color) !important; } .table thead th { background: var(--bg-tertiary) !important; color: var(--accent-cyan) !important; border-color: var(--border-color) !important; font-weight: 600; text-transform: uppercase; font-size: 0.8rem; letter-spacing: 0.05em; } .table td { border-color: var(--border-color) !important; } .dataTables_wrapper { color: var(--text-secondary) !important; } .dataTables_wrapper .dataTables_filter input, .dataTables_wrapper .dataTables_length select { background: var(--bg-primary) !important; color: var(--text-secondary) !important; border: 1px solid var(--border-color) !important; border-radius: 6px !important; } .dataTables_wrapper .dataTables_paginate .paginate_button { color: var(--text-secondary) !important; } .dataTables_wrapper .dataTables_paginate .paginate_button.current { background: var(--accent-cyan) !important; color: var(--bg-primary) !important; border: none !important; border-radius: 6px !important; } /* TOC Sidebar */ #TOC { background: var(--bg-secondary) !important; border-right: 1px solid var(--border-color) !important; padding: 1rem !important; } #TOC a { color: var(--text-muted) !important; transition: color 0.2s ease; } #TOC a:hover, #TOC a.active { color: var(--accent-cyan) !important; } /* Scrollbar */ ::-webkit-scrollbar { width: 8px; height: 8px; } ::-webkit-scrollbar-track { background: var(--bg-primary); } ::-webkit-scrollbar-thumb { background: var(--border-color); border-radius: 4px; } ::-webkit-scrollbar-thumb:hover { background: var(--accent-cyan); } /* Plotly toolbar */ .modebar { background: transparent !important; } /* Smooth section reveals */ section { animation: fadeInUp 0.6s ease-out; } @keyframes fadeInUp { from { opacity: 0; transform: translateY(20px); } to { opacity: 1; transform: translateY(0); } } /* Business insight callout */ .insight-box { background: linear-gradient(135deg, rgba(0,212,255,0.08), rgba(121,40,202,0.08)); border: 1px solid rgba(0,212,255,0.3); border-radius: 10px; padding: 1.5rem; margin: 1.5rem 0; position: relative; } .insight-box::before { content: '💡'; position: absolute; top: -12px; left: 16px; font-size: 1.5rem; background: var(--bg-primary); padding: 0 0.5rem; } /* Hypothesis result box */ .result-box { background: var(--bg-secondary); border-left: 4px solid var(--accent-green); border-radius: 0 10px 10px 0; padding: 1.2rem 1.5rem; margin: 1rem 0; } .result-box.reject { border-left-color: var(--accent-magenta); } /* Image and figure styling */ .quarto-figure { background: var(--bg-secondary); border-radius: 12px; padding: 1rem; border: 1px solid var(--border-color); } /* Blockquotes */ blockquote { border-left: 3px solid var(--accent-purple) !important; background: rgba(121, 40, 202, 0.05) !important; padding: 1rem 1.5rem !important; border-radius: 0 8px 8px 0 !important; font-style: italic; color: var(--text-muted) !important; } /* Selection colour */ ::selection { background: rgba(0, 212, 255, 0.3); color: var(--text-primary); } /* Print-friendly overrides */ @media print { body { background: white !important; color: black !important; } h1, h2, h3 { -webkit-text-fill-color: #1a1a2e !important; } .metric-value { -webkit-text-fill-color: #1a1a2e !important; } } </style> ```    ```{r} #| label: setup #| include: false # ---- Install missing packages (runs silently) ---- required_pkgs <- c( "tidyverse", "plotly", "DT", "ggcorrplot", "car", "broom", "scales", "kableExtra", "patchwork", "rstatix", "moments", "viridis", "effectsize" ) invisible(lapply(required_pkgs, function(pkg) { if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg, quiet = TRUE) })) # ---- Load libraries ---- library(tidyverse) library(plotly) library(DT) library(ggcorrplot) library(car) library(broom) library(scales) library(kableExtra) library(patchwork) library(rstatix) library(moments) library(viridis) library(effectsize) # ---- Custom ggplot2 dark theme ---- theme_ca_dark <- function(base_size = 13) { theme_minimal(base_size = base_size) %+replace% theme( plot.background = element_rect(fill = "#161b22", colour = NA), panel.background = element_rect(fill = "#161b22", colour = NA), panel.grid.major = element_line(colour = "#30363d", linewidth = 0.3), panel.grid.minor = element_blank(), text = element_text(colour = "#c9d1d9", family = "sans"), plot.title = element_text(colour = "#f0f6fc", size = 16, face = "bold", margin = margin(b = 8)), plot.subtitle = element_text(colour = "#8b949e", size = 12, margin = margin(b = 12)), plot.caption = element_text(colour = "#8b949e", size = 9, hjust = 1), axis.text = element_text(colour = "#8b949e", size = 10), axis.title = element_text(colour = "#c9d1d9", size = 11), legend.background = element_rect(fill = "#161b22", colour = NA), legend.key = element_rect(fill = "#161b22", colour = NA), legend.text = element_text(colour = "#c9d1d9"), legend.title = element_text(colour = "#f0f6fc", face = "bold"), strip.text = element_text(colour = "#f0f6fc", face = "bold", size = 11), plot.margin = margin(15, 15, 15, 15) ) } # ---- Colour palettes ---- pal_type <- c("Commercial" = "#00d4ff", "Industrial" = "#ff006e", "Residential" = "#ffbe0b") pal_status <- c("Completed" = "#00ff87", "Ongoing" = "#ffbe0b", "Proposal" = "#8b949e") pal_disc <- c("MEP" = "#00d4ff", "E Only" = "#ff006e", "M Only" = "#ffbe0b") ca_accent <- c("#00d4ff", "#ff006e", "#ffbe0b", "#7928ca", "#00ff87", "#ff6b35") # Apply theme globally theme_set(theme_ca_dark()) ``` ```{r} #| label: load-data #| include: false # ---- Load and clean data ---- df_raw <- read_csv("mep_projects.csv", show_col_types = FALSE) df <- df_raw %>% mutate( # Handle Project_Value_NGN whether read_csv parsed it as character or numeric Project_Value = if (is.character(Project_Value_NGN)) parse_number(Project_Value_NGN) else as.numeric(Project_Value_NGN), Design_Duration_Days = as.numeric(Design_Duration_Days), Revision_Count = as.numeric(Revision_Count), Project_Year = as.integer(Project_Year), Project_Type = factor(Project_Type, levels = c("Commercial", "Industrial", "Residential")), Project_Status = factor(Project_Status, levels = c("Completed", "Ongoing", "Proposal")), Discipline = factor(Discipline), # Derived variables Value_Millions = Project_Value / 1e6, Log_Value = log(Project_Value), Era = case_when( Project_Year <= 2017 ~ "Early (2006-2017)", Project_Year <= 2022 ~ "Growth (2018-2022)", TRUE ~ "Recent (2023-2026)" ), Era = factor(Era, levels = c("Early (2006-2017)", "Growth (2018-2022)", "Recent (2023-2026)")), Is_Design = !is.na(Design_Duration_Days), Value_Category = case_when( Value_Millions < 15 ~ "Small (<₦15M)", Value_Millions < 60 ~ "Medium (₦15-60M)", Value_Millions < 150 ~ "Large (₦60-150M)", TRUE ~ "Mega (>₦150M)" ), Value_Category = factor(Value_Category, levels = c( "Small (<₦15M)", "Medium (₦15-60M)", "Large (₦60-150M)", "Mega (>₦150M)" )) ) # Subset with complete design data df_design <- df %>% filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) ```    # 1. Executive Summary This study analyses the project portfolio of **CA Consultants Limited**, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on `r nrow(df)` real project records spanning `r min(df$Project_Year, na.rm=TRUE)`--`r max(df$Project_Year, na.rm=TRUE)`, extracted from the firm's internal project management systems, I investigate **what drives project value** across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques --- exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression --- are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: **longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.**    # 2. Professional Disclosure **Name:** Chidubem Mba **Title:** Senior Project Engineer & Project Manager **Organisation:** CA Consultants Limited, Lagos, Nigeria **Sector:** Construction / Building Services (MEP Engineering) CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings --- from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations. ### Technique Justification **Exploratory Data Analysis (EDA):** Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses. **Data Visualisation:** Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence --- showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting. **Hypothesis Testing:** Resource allocation decisions at CA Consultants --- such as whether to pursue more commercial versus industrial work --- must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy. **Correlation Analysis:** Understanding the relationships between project characteristics --- value, design duration, revision count, and scope --- is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors. **Linear Regression:** The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.    # 3. Data Collection & Sampling ### Source and Collection Method The dataset was extracted directly from the **internal project register** of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes --- from initial proposal through to completion. Data extraction was performed manually from the company's enterprise filing system and project tracking spreadsheets in April 2026. ### Sampling Frame and Size The dataset constitutes a **census of all identifiable projects** in the firm's records from 2006 to 2026, comprising **`r nrow(df)` project records**. This is not a sample but a near-complete enumeration of the firm's project history, subject to the availability of records in the company's systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records. ### Variables Collected | Variable | Type | Description | |:---|:---|:---| | `Project_ID` | Identifier | Unique project code assigned by the firm | | `Project_Type` | Categorical | Building type: Commercial, Industrial, or Residential | | `Project_Status` | Categorical | Current status: Completed, Ongoing, or Proposal | | `Project_Value_NGN` | Numeric (continuous) | Contract value in Nigerian Naira (₦) | | `Design_Duration_Days` | Numeric (continuous) | Duration of the design phase in calendar days | | `Revision_Count` | Numeric (discrete) | Number of design revisions issued | | `Project_Year` | Date/Time (year) | Year the project was initiated | | `Discipline` | Categorical | Scope of MEP services: MEP (full), E Only, M Only | ### Handling of Missing Data A total of **`r sum(is.na(df$Design_Duration_Days))` observations** are missing `Design_Duration_Days` and **`r sum(is.na(df$Revision_Count))` observations** are missing `Revision_Count`. These are **structurally missing** --- not data-entry errors. These projects were **construction supervision engagements only**: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis --- these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression). ### Ethical Statement This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm's project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.    # 4. Data Description ```{r} #| label: data-metrics #| echo: false n_proj <- nrow(df) total_val <- sum(df$Project_Value, na.rm = TRUE) avg_val <- mean(df$Project_Value, na.rm = TRUE) med_val <- median(df$Project_Value, na.rm = TRUE) n_completed <- sum(df$Project_Status == "Completed") avg_dur <- mean(df$Design_Duration_Days, na.rm = TRUE) htmltools::browsable(htmltools::HTML(sprintf(' <div class="metric-grid"> <div class="metric-card"> <span class="metric-label">Total Projects</span> <span class="metric-value">%d</span> <span class="metric-sub">2006 — 2026</span> </div> <div class="metric-card"> <span class="metric-label">Portfolio Value</span> <span class="metric-value">₦%.1fB</span> <span class="metric-sub">Aggregate Contract Value</span> </div> <div class="metric-card"> <span class="metric-label">Median Project Value</span> <span class="metric-value">₦%.1fM</span> <span class="metric-sub">Central tendency</span> </div> <div class="metric-card"> <span class="metric-label">Completed Projects</span> <span class="metric-value">%d</span> <span class="metric-sub">%.0f%%%% completion rate</span> </div> <div class="metric-card"> <span class="metric-label">Avg. Design Duration</span> <span class="metric-value">%.0f days</span> <span class="metric-sub">Design-phase projects</span> </div> </div> ', n_proj, total_val/1e9, med_val/1e6, n_completed, 100*n_completed/n_proj, avg_dur))) ``` ## 4.1 Summary Statistics ```{r} #| label: summary-stats df %>% select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>% group_by(Variable) %>% summarise( N = sum(!is.na(Value)), Missing = sum(is.na(Value)), Mean = mean(Value, na.rm = TRUE), Median = median(Value, na.rm = TRUE), SD = sd(Value, na.rm = TRUE), Min = min(Value, na.rm = TRUE), Max = max(Value, na.rm = TRUE), Skewness = skewness(Value, na.rm = TRUE), .groups = "drop" ) %>% mutate(across(where(is.numeric), ~round(., 2))) %>% kbl(caption = "Descriptive Statistics for Numeric Variables", format.args = list(big.mark = ",")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ## 4.2 Categorical Variable Distribution ```{r} #| label: cat-distribution cat_summary <- bind_rows( df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"), df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"), df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline") ) %>% group_by(Variable) %>% mutate(Pct = round(100 * n / sum(n), 1)) %>% ungroup() %>% select(Variable, Category, Count = n, `%` = Pct) cat_summary %>% kbl(caption = "Frequency Distribution of Categorical Variables") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>% pack_rows(index = table(cat_summary$Variable)) ``` ## 4.3 Interactive Data Explorer ```{r} #| label: data-table df %>% select(Project_ID, Project_Type, Project_Status, Value_Millions, Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>% mutate(Value_Millions = round(Value_Millions, 2)) %>% rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days, Revisions = Revision_Count, Year = Project_Year, Type = Project_Type, Status = Project_Status, ID = Project_ID) %>% datatable( caption = "Full Project Dataset — CA Consultants Limited", filter = "top", options = list(pageLength = 10, scrollX = TRUE, dom = 'Bfrtip', initComplete = JS( "function(settings, json) {", " $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});", "}" )), rownames = FALSE ) ```    # 5. Exploratory Data Analysis (EDA) > *"Far better an approximate answer to the right question... than an exact answer to the wrong question." --- John Tukey* **Business Justification:** Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics. ::: {.panel-tabset} ## R ### 5.1 Missing Value Analysis ```{r} #| label: eda-missing missing_summary <- df %>% summarise(across(everything(), ~sum(is.na(.)))) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>% mutate( Total = nrow(df), `% Missing` = round(100 * Missing / Total, 1), Status = if_else(Missing == 0, "Complete", "Has Missing") ) %>% filter(Missing > 0) %>% arrange(desc(Missing)) missing_summary %>% kbl(caption = "Variables with Missing Values") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ```{=html} <div class="insight-box"> <strong>Data Quality Finding 1 — Structurally Missing Values:</strong> Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided <strong>construction supervision only</strong> — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling. </div> ``` ### 5.2 Outlier Detection ```{r} #| label: eda-outliers # IQR-based outlier detection on Project Value Q1 <- quantile(df$Project_Value, 0.25, na.rm = TRUE) Q3 <- quantile(df$Project_Value, 0.75, na.rm = TRUE) IQR_val <- Q3 - Q1 lower_fence <- Q1 - 1.5 * IQR_val upper_fence <- Q3 + 1.5 * IQR_val outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence) cat(sprintf("**IQR Method for Project Value:**\n\n- Q1 = ₦%s | Q3 = ₦%s | IQR = ₦%s\n- Upper fence = ₦%s\n- **%d outliers detected** (%.1f%% of data) — all on the high end\n", comma(Q1), comma(Q3), comma(IQR_val), comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df))) ``` ```{r} #| label: eda-outlier-plot #| fig-cap: "Project Value Distribution with Outlier Boundaries" ggplot(df, aes(x = Project_Value / 1e6)) + geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") + geom_density(colour = "#ff006e", linewidth = 1.2) + geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) + annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence", colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) + labs(title = "Distribution of Project Values", subtitle = "Right-skewed with high-value outliers typical of construction portfolios", x = "Project Value (₦ Millions)", y = "Density") + scale_x_continuous(labels = comma_format()) ``` ```{=html} <div class="insight-box"> <strong>Data Quality Finding 2 — Right Skewness:</strong> Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a <strong>log-transformation</strong> will be applied to stabilise variance and improve model fit. </div> ``` ### 5.3 Distribution by Category ```{r} #| label: eda-distributions #| fig-height: 8 p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_type) + labs(title = "Projects by Building Type", x = NULL, y = "Count") + guides(fill = "none") p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_status) + labs(title = "Projects by Status", x = NULL, y = "Count") + guides(fill = "none") p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_disc) + labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") + guides(fill = "none") (p1 | p2 | p3) + plot_annotation( title = "Portfolio Composition at a Glance", subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio", theme = theme_ca_dark() ) ``` ## Python ```{python} #| label: eda-python import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib matplotlib.rcParams.update({ 'figure.facecolor': '#161b22', 'axes.facecolor': '#161b22', 'axes.edgecolor': '#30363d', 'axes.labelcolor': '#c9d1d9', 'text.color': '#c9d1d9', 'xtick.color': '#8b949e', 'ytick.color': '#8b949e', 'grid.color': '#30363d', 'font.family': 'sans-serif', 'font.size': 11 }) # Load and clean data df_py = pd.read_csv("mep_projects.csv") df_py['Project_Value'] = df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True).astype(float) df_py['Value_Millions'] = df_py['Project_Value'] / 1e6 # Summary statistics print("=" * 60) print("NUMERIC VARIABLE SUMMARY") print("=" * 60) print(df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2).to_string()) print("\n" + "=" * 60) print("MISSING VALUES") print("=" * 60) print(df_py.isnull().sum().to_string()) print("\n" + "=" * 60) print("CATEGORICAL DISTRIBUTIONS") print("=" * 60) for col in ['Project_Type', 'Project_Status', 'Discipline']: print(f"\n{col}:") counts = df_py[col].value_counts() for val, cnt in counts.items(): print(f" {val}: {cnt} ({100*cnt/len(df_py):.1f}%)") ``` ```{python} #| label: eda-python-plot #| fig-cap: "Distribution of Project Values (Python/Matplotlib)" fig, axes = plt.subplots(1, 3, figsize=(14, 5)) # Histogram of values axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22') axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold') axes[0].set_xlabel('Value (₦ Millions)') axes[0].set_ylabel('Frequency') # Bar chart: Project Type type_counts = df_py['Project_Type'].value_counts() colors_type = ['#00d4ff', '#ff006e', '#ffbe0b'] axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85) axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold') axes[1].set_xlabel('Count') # Bar chart: Status status_counts = df_py['Project_Status'].value_counts() colors_status = ['#00ff87', '#ffbe0b', '#8b949e'] axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85) axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold') axes[2].set_xlabel('Count') plt.tight_layout() plt.show() ``` :::    # 6. Data Visualisation **Business Justification:** Visualisation transforms CA Consultants' project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question --- and explaining why --- is itself an analytical skill. ::: {.panel-tabset} ## R ### 6.1 Portfolio Value by Building Type ```{r} #| label: viz-violin #| fig-cap: "Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average" p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) + geom_violin(alpha = 0.3, colour = NA, scale = "width") + geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") + geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + labs(title = "Project Value Distribution by Building Type", subtitle = "Commercial projects show the widest range; Residential clusters at lower values", x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none", colour = "none") + scale_y_continuous(labels = comma_format()) ggplotly(p, tooltip = c("y")) %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` > **Why a violin + boxplot?** A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts. ### 6.2 Portfolio Evolution Over Time ```{r} #| label: viz-timeline #| fig-cap: "Annual project count and composition showing the firm's growth trajectory" year_type <- df %>% count(Project_Year, Project_Type) %>% complete(Project_Year, Project_Type, fill = list(n = 0)) ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) + geom_area(alpha = 0.7, position = "stack") + scale_fill_manual(values = pal_type) + labs(title = "CA Consultants Project Volume Over Time", subtitle = "Accelerating growth since 2020 driven by commercial engagements", x = "Year", y = "Number of Projects", fill = "Building Type") + scale_x_continuous(breaks = seq(2006, 2026, 2)) ``` ### 6.3 Design Complexity: Duration vs. Revisions ```{r} #| label: viz-scatter #| fig-cap: "Each bubble is a project — size encodes value, colour encodes building type" p_scatter <- df %>% filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>% ggplot(aes(x = Design_Duration_Days, y = Revision_Count, size = Value_Millions, colour = Project_Type, text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) + geom_point(alpha = 0.7) + scale_colour_manual(values = pal_type) + scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) + labs(title = "Design Complexity Landscape", subtitle = "Longer design phases and more revisions correlate with higher-value projects", x = "Design Duration (Days)", y = "Number of Revisions", colour = "Building Type", size = "Value (₦M)") ggplotly(p_scatter, tooltip = "text") %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` > **Why a bubble chart?** Three dimensions of project complexity --- duration, revision count, and value --- are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees. ### 6.4 Revenue Concentration by Project Size Category ```{r} #| label: viz-treemap #| fig-cap: "Comparing the number of projects versus the revenue they generate" size_summary <- df %>% group_by(Value_Category) %>% summarise( Count = n(), Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9, .groups = "drop" ) %>% mutate( Pct_Count = 100 * Count / sum(Count), Pct_Value = 100 * Total_Value / sum(Total_Value) ) p_conc <- size_summary %>% pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>% mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>% ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) + geom_col(position = "dodge", alpha = 0.85, width = 0.6) + scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) + labs(title = "The Revenue Concentration Paradox", subtitle = "Mega-projects are rare but generate a disproportionate share of revenue", x = NULL, y = "Percentage (%)", fill = NULL) + geom_text(aes(label = paste0(round(Pct, 1), "%")), position = position_dodge(width = 0.6), vjust = -0.5, colour = "#c9d1d9", size = 3.2) p_conc ``` ### 6.5 Discipline Scope and Project Value ```{r} #| label: viz-discipline #| fig-cap: "Full MEP scope commands higher values than single-discipline engagements" ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) + geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) + scale_fill_manual(values = pal_disc) + coord_flip() + labs(title = "Project Value by Discipline Scope", subtitle = "Full-scope MEP projects command significantly higher contract values", x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none") + scale_y_continuous(labels = comma_format()) ``` ## Python ```{python} #| label: viz-python #| fig-cap: "Python visualisation suite using Seaborn" import seaborn as sns fig, axes = plt.subplots(2, 2, figsize=(14, 11)) # 1. Violin: Value by Type palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'} order = ['Commercial', 'Industrial', 'Residential'] sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order, palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0]) axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold') axes[0,0].set_xlabel('') axes[0,0].set_ylabel('Value (₦ Millions)') # 2. Timeline year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0) year_counts.plot.area(ax=axes[0,1], alpha=0.7, color=[palette_type.get(c, '#888') for c in year_counts.columns]) axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold') axes[0,1].set_xlabel('Year') axes[0,1].set_ylabel('Number of Projects') axes[0,1].legend(fontsize=8) # 3. Scatter: Duration vs Value mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna() for ptype, colour in palette_type.items(): subset = df_py[mask & (df_py['Project_Type'] == ptype)] axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'], c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none') axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold') axes[1,0].set_xlabel('Design Duration (Days)') axes[1,0].set_ylabel('Value (₦ Millions)') axes[1,0].legend(fontsize=8) # 4. Box: Value by Discipline disc_order = ['MEP', 'E Only', 'M Only'] palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'} sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order, palette=palette_disc, ax=axes[1,1]) axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold') axes[1,1].set_xlabel('') axes[1,1].set_ylabel('Value (₦ Millions)') plt.tight_layout() plt.show() ``` :::    # 7. Hypothesis Testing **Business Justification:** When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from "it looks like commercial projects are worth more" to "we can be 95% confident that the difference is real and not due to chance." This is the difference between data-informed strategy and guesswork. ::: {.panel-tabset} ## R ### 7.1 Hypothesis 1: Project Value Differs by Building Type **H₀:** The mean project value is equal across Commercial, Industrial, and Residential projects. **H₁:** At least one building type has a significantly different mean project value. **Test:** Given the right-skewed distribution, we use the **Kruskal-Wallis rank-sum test** (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison. ```{r} #| label: hyp1-test # Check normality within groups (Shapiro-Wilk) norm_tests <- df %>% group_by(Project_Type) %>% summarise( n = n(), shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_, .groups = "drop" ) norm_tests %>% mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>% kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ```{r} #| label: hyp1-kruskal # Kruskal-Wallis Test (non-parametric) kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df) # Effect size (epsilon-squared) kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type) # Pairwise Wilcoxon with Bonferroni correction pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni") cat(sprintf("**Kruskal-Wallis Test:**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n", kw_test$statistic, kw_test$parameter, kw_test$p.value, kw_effect$effsize, kw_effect$magnitude)) ``` ```{r} #| label: hyp1-pairwise # Display pairwise comparisons cat("**Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted):**\n\n") print(pairwise) ``` ```{r} #| label: hyp1-visual #| fig-cap: "Group comparison visualisation for Hypothesis 1" # Group medians for annotation group_med <- df %>% group_by(Project_Type) %>% summarise(med = median(Project_Value / 1e6), .groups = "drop") ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) + geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) + geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + geom_text(data = group_med, aes(x = Project_Type, y = med, label = paste0("Median: ₦", round(med, 1), "M")), vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") + labs(title = "Project Value by Building Type", subtitle = sprintf("Kruskal-Wallis p = %.4f — %s", kw_test$p.value, if_else(kw_test$p.value < 0.05, "Significant difference detected", "No significant difference")), x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none", colour = "none") + scale_y_continuous(labels = comma_format()) ``` ```{=html} <div class="result-box reject"> <strong>Hypothesis 1 — Result:</strong> We <strong>reject H₀</strong> at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one. </div> ``` ### 7.2 Hypothesis 2: Design Duration Differs by Building Type **H₀:** The mean design duration is equal across Commercial, Industrial, and Residential projects. **H₁:** At least one building type has a significantly different mean design duration. **Test:** Kruskal-Wallis (non-parametric), using only projects with design data. ```{r} #| label: hyp2-test df_dur <- df %>% filter(!is.na(Design_Duration_Days)) # Kruskal-Wallis kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur) kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type) cat(sprintf("**Kruskal-Wallis Test (Design Duration ~ Building Type):**\n\n- χ² = %.2f, df = %d, **p-value = %.4f**\n- Effect size (η²[H]): %.3f (%s)\n- Observations used: %d (excluding supervision-only projects)\n", kw_dur$statistic, kw_dur$parameter, kw_dur$p.value, kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur))) ``` ```{r} #| label: hyp2-visual #| fig-cap: "Design duration comparison across building types" dur_med <- df_dur %>% group_by(Project_Type) %>% summarise(med = median(Design_Duration_Days), .groups = "drop") ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) + geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) + geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + geom_text(data = dur_med, aes(x = Project_Type, y = med, label = paste0("Median: ", round(med, 0), " days")), vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") + labs(title = "Design Duration by Building Type", subtitle = sprintf("Kruskal-Wallis p = %.4f — %s", kw_dur$p.value, if_else(kw_dur$p.value < 0.05, "Significant difference detected", "No significant difference")), x = NULL, y = "Design Duration (Days)") + guides(fill = "none", colour = "none") ``` ```{r} #| label: hyp2-result #| echo: false if (kw_dur$p.value < 0.05) { htmltools::browsable(htmltools::HTML(sprintf( '<div class="result-box reject"><strong>Hypothesis 2 — Result:</strong> We <strong>reject H₀</strong> (p = %.4f). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.</div>', kw_dur$p.value))) } else { htmltools::browsable(htmltools::HTML(sprintf( '<div class="result-box"><strong>Hypothesis 2 — Result:</strong> We <strong>fail to reject H₀</strong> (p = %.4f). There is no statistically significant difference in design duration across building types at the 5%%%% level. While median durations may appear different, the variability within each group is large enough that we cannot conclude the differences are systematic. For project planning, this suggests that building type alone is not a reliable predictor of design timeline.</div>', kw_dur$p.value))) } ``` ## Python ```{python} #| label: hyp-python from scipy import stats # Clean numeric columns df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce') df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce') df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce') # Hypothesis 1: Value ~ Project Type groups_val = [grp['Project_Value'].dropna().values for _, grp in df_py.groupby('Project_Type')] kw_stat, kw_p = stats.kruskal(*groups_val) print("=" * 60) print("HYPOTHESIS 1: Project Value ~ Building Type") print("=" * 60) print(f"Kruskal-Wallis H = {kw_stat:.3f}, p-value = {kw_p:.4f}") print(f"Decision: {'Reject H₀' if kw_p < 0.05 else 'Fail to reject H₀'} at α = 0.05") # Hypothesis 2: Duration ~ Project Type df_dur_py = df_py.dropna(subset=['Design_Duration_Days']) groups_dur = [grp['Design_Duration_Days'].dropna().values for _, grp in df_dur_py.groupby('Project_Type')] kw_stat2, kw_p2 = stats.kruskal(*groups_dur) print(f"\n{'=' * 60}") print("HYPOTHESIS 2: Design Duration ~ Building Type") print("=" * 60) print(f"Kruskal-Wallis H = {kw_stat2:.3f}, p-value = {kw_p2:.4f}") print(f"Decision: {'Reject H₀' if kw_p2 < 0.05 else 'Fail to reject H₀'} at α = 0.05") # Group medians print(f"\n{'=' * 60}") print("GROUP MEDIANS — Project Value (₦ Millions)") print("=" * 60) print(df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_string()) ``` :::    # 8. Correlation Analysis **Business Justification:** Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions. ::: {.panel-tabset} ## R ### 8.1 Correlation Matrix ```{r} #| label: corr-matrix #| fig-cap: "Correlation heatmap of numeric project variables (Spearman method)" # Use Spearman due to non-normality corr_data <- df_design %>% select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>% rename( `Project Value` = Project_Value, `Design Duration` = Design_Duration_Days, `Revision Count` = Revision_Count, `Project Year` = Project_Year ) corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs") ggcorrplot(corr_mat, type = "lower", lab = TRUE, lab_size = 4.5, lab_col = "#f0f6fc", colors = c("#ff006e", "#161b22", "#00d4ff"), outline.color = "#30363d", ggtheme = theme_ca_dark()) + labs(title = "Spearman Correlation Matrix", subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design))) ``` ### 8.2 Key Correlations — Business Interpretation ```{r} #| label: corr-detail # Extract and rank correlations corr_pairs <- as.data.frame(as.table(corr_mat)) %>% filter(Var1 != Var2) %>% mutate(abs_corr = abs(Freq)) %>% arrange(desc(abs_corr)) %>% distinct(abs_corr, .keep_all = TRUE) %>% # remove duplicates head(6) %>% rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>% select(-abs_corr) corr_pairs %>% mutate(Spearman_rho = round(Spearman_rho, 3), Strength = case_when( abs(Spearman_rho) >= 0.7 ~ "Strong", abs(Spearman_rho) >= 0.4 ~ "Moderate", TRUE ~ "Weak" )) %>% kbl(caption = "Ranked Pairwise Correlations") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 8.3 Scatterplot: Strongest Correlation ```{r} #| label: corr-scatter #| fig-cap: "Relationship between design duration and project value — the strongest observed correlation" p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) + geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")), alpha = 0.7, size = 3) + geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15, linetype = "dashed", formula = y ~ x) + scale_colour_manual(values = pal_type) + labs(title = "Design Duration vs. Project Value", subtitle = "Longer design phases are associated with higher-value projects", x = "Design Duration (Days)", y = "Project Value (₦ Millions)", colour = "Building Type") + scale_y_continuous(labels = comma_format()) ggplotly(p_corr, tooltip = "text") %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` ```{=html} <div class="insight-box"> <strong>Correlation ≠ Causation:</strong> The positive correlation between design duration and project value does not necessarily mean spending more time on design <em>causes</em> higher fees. The most plausible causal direction is <strong>reverse</strong>: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for <strong>fee estimation</strong>: knowing the expected design duration provides a reasonable proxy for project value. </div> ``` ## Python ```{python} #| label: corr-python #| fig-cap: "Correlation heatmap (Python/Seaborn)" # Prepare design-complete subset df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count']) corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year'] corr_matrix = df_design_py[corr_cols].corr(method='spearman') fig, ax = plt.subplots(figsize=(8, 7)) mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f', cmap='coolwarm', center=0, linewidths=1, linecolor='#30363d', cbar_kws={'shrink': 0.8}, annot_kws={'color': '#f0f6fc', 'fontsize': 12}, ax=ax) ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15) plt.tight_layout() plt.show() # Print top correlations print("\nTop Correlations (by absolute value):") pairs = [] for i in range(len(corr_matrix.columns)): for j in range(i+1, len(corr_matrix.columns)): pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j])) pairs.sort(key=lambda x: abs(x[2]), reverse=True) for v1, v2, rho in pairs: print(f" {v1} — {v2}: ρ = {rho:.3f}") ``` :::    # 9. Linear Regression **Business Justification:** Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a **predictive model** that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: "Given a project's building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?" Each regression coefficient translates directly into a pricing lever. ::: {.panel-tabset} ## R ### 9.1 Model Specification We model **log-transformed project value** to address the right-skewness identified in the EDA. The predictors are: - `Design_Duration_Days` — length of design phase (continuous) - `Revision_Count` — number of design iterations (continuous) - `Project_Type` — building type (categorical: Commercial, Industrial, Residential) - `Discipline` — scope of MEP services (categorical) - `Project_Year` — year initiated (continuous, captures inflation/growth) ```{r} #| label: reg-model # Fit the model on design-complete observations model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count + Project_Type + Discipline + Project_Year, data = df_design) # Model summary model_summary <- summary(model) model_tidy <- tidy(model, conf.int = TRUE) model_glance <- glance(model) cat(sprintf("**Model Fit:**\n\n- R² = %.3f (Adjusted R² = %.3f)\n- F(%d, %d) = %.2f, p = %.2e\n- Residual SE = %.3f\n- Observations: %d\n", model_glance$r.squared, model_glance$adj.r.squared, model_glance$df, model_glance$df.residual, model_glance$statistic, model_glance$p.value, model_glance$sigma, nrow(df_design))) ``` ### 9.2 Coefficient Table ```{r} #| label: reg-coef model_tidy %>% mutate( across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)), p.value = format.pval(p.value, digits = 3), Significance = case_when( as.numeric(p.value) < 0.001 ~ "***", as.numeric(p.value) < 0.01 ~ "**", as.numeric(p.value) < 0.05 ~ "*", as.numeric(p.value) < 0.1 ~ ".", TRUE ~ "" ) ) %>% select(Term = term, Estimate = estimate, `Std. Error` = std.error, `t-value` = statistic, `p-value` = p.value, Significance, `95% CI Low` = conf.low, `95% CI High` = conf.high) %>% kbl(caption = "Regression Coefficients — log(Project Value)") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 9.3 Coefficient Plot ```{r} #| label: reg-coef-plot #| fig-cap: "Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant" model_tidy %>% filter(term != "(Intercept)") %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(x = estimate, y = term)) + geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") + geom_point(colour = "#00d4ff", size = 3) + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, colour = "#00d4ff", linewidth = 0.8) + labs(title = "Regression Coefficient Estimates", subtitle = "log(Project Value) model — points show estimate, bars show 95% CI", x = "Estimate (log scale)", y = NULL) ``` ### 9.4 Diagnostic Plots ```{r} #| label: reg-diagnostics #| fig-height: 9 #| fig-cap: "Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points" par(mfrow = c(2, 2), bg = "#161b22", fg = "#c9d1d9", col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc") plot(model, col = "#00d4ff", pch = 16, cex = 0.8, sub.caption = "") ``` ```{r} #| label: reg-vif # Variance Inflation Factors vif_vals <- car::vif(model) vif_df <- if (is.matrix(vif_vals)) { data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2), Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2)) } else { data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2)) } vif_df %>% kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 9.5 Business Interpretation ```{r} #| label: reg-interpret #| echo: false # Extract key coefficients coefs <- coef(model) pvals <- model_tidy$p.value # Duration coefficient dur_coef <- coefs["Design_Duration_Days"] dur_pct <- (exp(dur_coef) - 1) * 100 rev_pct <- (exp(coefs["Revision_Count"]) - 1) * 100 dur10_value <- 50 * (exp(dur_coef * 10) - 1) htmltools::browsable(htmltools::HTML(sprintf(' <p><strong>Interpreting Key Coefficients (on the log scale → percentage change in value):</strong></p> <ul> <li><strong>Design Duration:</strong> Each additional design day is associated with a <strong>%.1f%%%% change</strong> in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦%.1fM in additional value.</li> <li><strong>Revision Count:</strong> Each additional revision is associated with a <strong>%.1f%%%% change</strong> in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.</li> <li><strong>Project Year:</strong> The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.</li> </ul> <div class="insight-box"> <strong>Actionable Recommendation:</strong> The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a <strong>fee estimation formula</strong> anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula. </div> ', dur_pct, dur10_value, rev_pct))) ``` ## Python ```{python} #| label: reg-python import statsmodels.api as sm import statsmodels.formula.api as smf # Prepare data df_reg_py = df_design_py.copy() df_reg_py['log_value'] = np.log(df_reg_py['Project_Value']) # Fit OLS model formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year' model_py = smf.ols(formula, data=df_reg_py).fit() print(model_py.summary()) ``` ```{python} #| label: reg-python-diag #| fig-cap: "Residual diagnostic plots (Python/statsmodels)" fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Residuals vs Fitted fitted = model_py.fittedvalues resids = model_py.resid axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none') axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7) axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold') axes[0].set_xlabel('Fitted Values') axes[0].set_ylabel('Residuals') # Q-Q Plot from scipy.stats import probplot probplot(resids, plot=axes[1]) axes[1].get_lines()[0].set_color('#00d4ff') axes[1].get_lines()[0].set_markersize(5) axes[1].get_lines()[1].set_color('#ff006e') axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold') plt.tight_layout() plt.show() ``` :::    # 10. Integrated Findings The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants' project portfolio and the drivers of project value: **1. The Portfolio is Concentrated and Skewed.** EDA revealed that the firm's revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact). **2. Commercial Projects Are Statistically More Valuable.** Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification. **3. Design Complexity Predicts Value.** Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope. **4. Full-Scope MEP Engagements Outperform.** Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts. ::: {.callout-tip} ## Single Integrated Recommendation CA Consultants should **anchor its growth strategy on full-scope MEP design for commercial buildings**, using the regression model's coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect. :::    # 11. Limitations & Further Work **Data limitations:** - The dataset contains `r nrow(df)` observations spanning 20 years. While this represents the firm's full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse. - Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to `r nrow(df_design)` observations. - Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling. - The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value. **Methodological limitations:** - Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives. - The regression model uses cross-sectional data without controls for inflation or market conditions; the `Project_Year` variable is a rough proxy. - Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence. **Future work with more data, time, or computing power:** - Collect **client industry codes** and **location data** (mainland Lagos vs. island) to model geographic and sectoral pricing effects. - Build a **time-series model** of project pipeline values to forecast quarterly revenue. - Introduce **design team size** and **junior/senior engineer ratio** as predictors to model the labour-cost side of project economics. - Develop an **interactive dashboard** (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.    # References Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. <https://markanalytics.online> Mba, C. (2026). *CA Consultants Limited MEP project register (2006–2026)* [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author. R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. <https://www.R-project.org/> Van Rossum, G., & Drake, F. L. (2009). *Python 3 reference manual.* CreateSpace. Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. <https://doi.org/10.21105/joss.01686> Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. <https://doi.org/10.1007/978-3-319-24277-4> Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny.* Chapman and Hall/CRC. <https://plotly-r.com> Kassambara, A. (2023). *ggcorrplot: Visualization of a correlation matrix using ggplot2* (R package). <https://CRAN.R-project.org/package=ggcorrplot> Fox, J., & Weisberg, S. (2019). *An R companion to applied regression* (3rd ed.). Sage. <https://socialsciences.mcmaster.ca/jfox/Books/Companion/> McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). <https://doi.org/10.25080/Majora-92bf1922-00a> Pedersen, T. L. (2024). *patchwork: The composer of plots* (R package). <https://CRAN.R-project.org/package=patchwork> Kassambara, A. (2023). *rstatix: Pipe-friendly framework for basic statistical tests* (R package). <https://CRAN.R-project.org/package=rstatix> Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) [Computer software]. <https://doi.org/10.5281/zenodo.5960048>    # Appendix: AI Usage Statement This submission was prepared with the assistance of **Claude** (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality). All **analytical decisions** were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants' internal systems. The author can explain every line of code and every result in this document.