Building By The Numbers

What 101 Projects Reveal About Design, Complexity, and Revenue at CA Consultants Limited

Author

Chidubem Mba — Senior Project Engineer & Project Manager

Published

May 25, 2026

1. Executive Summary

This study analyses the project portfolio of CA Consultants Limited, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on 100 real project records spanning 2006–2026, extracted from the firm’s internal project management systems, I investigate what drives project value across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression — are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.

2. Professional Disclosure

Name: Chidubem Mba Title: Senior Project Engineer & Project Manager Organisation: CA Consultants Limited, Lagos, Nigeria Sector: Construction / Building Services (MEP Engineering)

CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings — from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations.

Technique Justification

Exploratory Data Analysis (EDA): Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses.

Data Visualisation: Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence — showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting.

Hypothesis Testing: Resource allocation decisions at CA Consultants — such as whether to pursue more commercial versus industrial work — must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy.

Correlation Analysis: Understanding the relationships between project characteristics — value, design duration, revision count, and scope — is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors.

Linear Regression: The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.

3. Data Collection & Sampling

Source and Collection Method

The dataset was extracted directly from the internal project register of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes — from initial proposal through to completion. Data extraction was performed manually from the company’s enterprise filing system and project tracking spreadsheets in April 2026.

Sampling Frame and Size

The dataset constitutes a census of all identifiable projects in the firm’s records from 2006 to 2026, comprising 100 project records. This is not a sample but a near-complete enumeration of the firm’s project history, subject to the availability of records in the company’s systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records.

Variables Collected

Variable	Type	Description
`Project_ID`	Identifier	Unique project code assigned by the firm
`Project_Type`	Categorical	Building type: Commercial, Industrial, or Residential
`Project_Status`	Categorical	Current status: Completed, Ongoing, or Proposal
`Project_Value_NGN`	Numeric (continuous)	Contract value in Nigerian Naira (₦)
`Design_Duration_Days`	Numeric (continuous)	Duration of the design phase in calendar days
`Revision_Count`	Numeric (discrete)	Number of design revisions issued
`Project_Year`	Date/Time (year)	Year the project was initiated
`Discipline`	Categorical	Scope of MEP services: MEP (full), E Only, M Only

Handling of Missing Data

A total of 8 observations are missing Design_Duration_Days and 14 observations are missing Revision_Count. These are structurally missing — not data-entry errors. These projects were construction supervision engagements only: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis — these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression).

Ethical Statement

This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm’s project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.

4. Data Description

Download Dataset (CSV)

4.1 Summary Statistics

View Code

df %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  group_by(Variable) %>%
  summarise(
    N       = sum(!is.na(Value)),
    Missing = sum(is.na(Value)),
    Mean    = mean(Value, na.rm = TRUE),
    Median  = median(Value, na.rm = TRUE),
    SD      = sd(Value, na.rm = TRUE),
    Min     = min(Value, na.rm = TRUE),
    Max     = max(Value, na.rm = TRUE),
    Skewness = skewness(Value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kbl(caption = "Descriptive Statistics for Numeric Variables",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Descriptive Statistics for Numeric Variables
Variable	N	Missing	Mean	Median	SD	Min	Max	Skewness
Design_Duration_Days	92	8	65.68	40	89.63	2	600	3.44
Project_Value	100	0	65,225,972.06	39,816,154	75,188,594.11	258,557	360,000,000	1.97
Project_Year	100	0	2,021.34	2,023	3.72	2,006	2,026	-1.26
Revision_Count	86	14	3.67	4	1.77	1	6	-0.22

4.2 Categorical Variable Distribution

View Code

cat_summary <- bind_rows(
  df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"),
  df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"),
  df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline")
) %>%
  group_by(Variable) %>%
  mutate(Pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  select(Variable, Category, Count = n, `%` = Pct)

cat_summary %>%
  kbl(caption = "Frequency Distribution of Categorical Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>%
  pack_rows(index = table(cat_summary$Variable))

Frequency Distribution of Categorical Variables
Variable	Category	Count	%
Discipline
Project_Type	Commercial	64	64
Project_Type	Industrial	14	14
Project_Type	Residential	22	22
Project_Status
Project_Status	Completed	57	57
Project_Status	Ongoing	15	15
Project_Status	Proposal	28	28
Project_Type
Discipline	E Only	9	9
Discipline	M Only	5	5
Discipline	MEP	86	86

4.3 Interactive Data Explorer

Use the dropdown filters below to dynamically explore the dataset by building type, project status, or discipline scope.

Download Dataset (CSV)

View Code

library(crosstalk)

# Prepare shareable data for crosstalk
df_shared <- df %>%
  select(Project_ID, Project_Type, Project_Status, Value_Millions,
         Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>%
  mutate(Value_Millions = round(Value_Millions, 2)) %>%
  rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days,
         Revisions = Revision_Count, Year = Project_Year,
         Type = Project_Type, Status = Project_Status, ID = Project_ID)

shared_df <- SharedData$new(df_shared)

# Crosstalk filter widgets
bscols(widths = c(4, 4, 4),
  filter_select("type_filter", "Building Type", shared_df, ~Type),
  filter_select("status_filter", "Project Status", shared_df, ~Status),
  filter_select("disc_filter", "Discipline", shared_df, ~Discipline)
)

Building Type

Project Status

Discipline

View Code

datatable(shared_df,
  caption = "Full Project Dataset — CA Consultants Limited (use filters above)",
  options = list(pageLength = 10, scrollX = TRUE,
                 dom = 'frtip',
                 initComplete = JS(
                   "function(settings, json) {",
                   "  $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});",
                   "}"
                 )),
  rownames = FALSE
)

5. Exploratory Data Analysis (EDA)

Technique 1 of 5 • Ch. 4

“Far better an approximate answer to the right question… than an exact answer to the wrong question.” — John Tukey

Business Justification: Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics. As Anscombe’s Quartet (1973) memorably demonstrates, summary statistics alone can conceal wildly different data structures — the only remedy is to look at the data before modelling it.

5.1 Missing Value Analysis

View Code

missing_summary <- df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  mutate(
    Total = nrow(df),
    `% Missing` = round(100 * Missing / Total, 1),
    Status = if_else(Missing == 0, "Complete", "Has Missing")
  ) %>%
  filter(Missing > 0) %>%
  arrange(desc(Missing))

missing_summary %>%
  kbl(caption = "Variables with Missing Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Variables with Missing Values
Variable	Missing	Total	% Missing	Status
Revision_Count	14	100	14	Has Missing
Design_Duration_Days	8	100	8	Has Missing

Data Quality Finding 1 — Structurally Missing Values: Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided construction supervision only — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling.

5.2 Outlier Detection

View Code

# IQR-based outlier detection on Project Value
Q1  <- quantile(df$Project_Value, 0.25, na.rm = TRUE)
Q3  <- quantile(df$Project_Value, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val

outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence)

htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-amber); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
  <h4 style="color: var(--accent-amber); margin: 0 0 0.5rem;">IQR Method for Project Value</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;">Q1</td><td>&#8358;%s</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Q3</td><td>&#8358;%s</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">IQR</td><td>&#8358;%s</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Upper fence</td><td>&#8358;%s</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>Outliers detected</strong></td><td><strong>%d</strong> (%.1f%%%% of data) &mdash; all on the high end</td></tr>
  </table>
</div>',
  comma(Q1), comma(Q3), comma(IQR_val),
  comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df))))

IQR Method for Project Value

Q1	₦13,793,971
Q3	₦86,419,210
IQR	₦72,625,240
Upper fence	₦195,357,070
Outliers detected	8 (8.0%% of data) — all on the high end

View Code

ggplot(df, aes(x = Project_Value / 1e6)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") +
  geom_density(colour = "#ff006e", linewidth = 1.2) +
  geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence",
           colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) +
  labs(title = "Distribution of Project Values",
       subtitle = "Right-skewed with high-value outliers typical of construction portfolios",
       x = "Project Value (₦ Millions)", y = "Density") +
  scale_x_continuous(labels = comma_format())

Project Value Distribution with Outlier Boundaries

Data Quality Finding 2 — Right Skewness: Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a log-transformation will be applied to stabilise variance and improve model fit.

5.3 Distribution by Category

View Code

p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_type) +
  labs(title = "Projects by Building Type", x = NULL, y = "Count") +
  guides(fill = "none")

p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_status) +
  labs(title = "Projects by Status", x = NULL, y = "Count") +
  guides(fill = "none")

p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) +
  geom_bar(width = 0.6, alpha = 0.85) +
  scale_fill_manual(values = pal_disc) +
  labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") +
  guides(fill = "none")

(p1 | p2 | p3) +
  plot_annotation(
    title = "Portfolio Composition at a Glance",
    subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio",
    theme = theme_ca_dark()
  )

View Code

import subprocess, sys
_ = subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "tabulate"])

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({
    'figure.facecolor': '#161b22',
    'axes.facecolor': '#161b22',
    'axes.edgecolor': '#30363d',
    'axes.labelcolor': '#c9d1d9',
    'text.color': '#c9d1d9',
    'xtick.color': '#8b949e',
    'ytick.color': '#8b949e',
    'grid.color': '#30363d',
    'font.family': 'sans-serif',
    'font.size': 11
})

# Load and clean data
df_py = pd.read_csv("mep_projects.csv")
if df_py['Project_Value_NGN'].dtype == 'object':
    df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
else:
    df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce')
df_py['Value_Millions'] = df_py['Project_Value'] / 1e6

# Summary statistics
desc = df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2)
print("\n**Numeric Variable Summary**\n")

Numeric Variable Summary

View Code

print(desc.to_markdown())

	Project_Value	Design_Duration_Days	Revision_Count	Project_Year
count	100	92	86	100
mean	6.5226e+07	65.68	3.67	2021.34
std	7.51886e+07	89.63	1.77	3.72
min	258557	2	1	2006
25%	1.3794e+07	20	2	2019
50%	3.98162e+07	40	4	2023
75%	8.64192e+07	80	4	2024
max	3.6e+08	600	6	2026

View Code

# Missing values
missing = df_py.isnull().sum().to_frame("Missing Count")
missing["Present"] = len(df_py) - missing["Missing Count"]
missing["% Missing"] = (100 * missing["Missing Count"] / len(df_py)).round(1)
print("\n\n**Missing Values**\n")

Missing Values

View Code

print(missing.to_markdown())

	Missing Count	Present	% Missing
Project_ID	0	100	0
Project_Type	0	100	0
Project_Status	0	100	0
Project_Value_NGN	0	100	0
Design_Duration_Days	8	92	8
Revision_Count	14	86	14
Project_Year	0	100	0
Discipline	0	100	0
Project_Value	0	100	0
Value_Millions	0	100	0

View Code

# Categorical distributions
print("\n\n**Categorical Distributions**\n")

Categorical Distributions

View Code

for col in ['Project_Type', 'Project_Status', 'Discipline']:
    counts = df_py[col].value_counts().to_frame("Count")
    counts["Percentage"] = (100 * counts["Count"] / len(df_py)).round(1).astype(str) + "%"
    print(f"\n*{col}*\n")
    print(counts.to_markdown())

Project_Type

Project_Type	Count	Percentage
Commercial	64	64.0%
Residential	22	22.0%
Industrial	14	14.0%

Project_Status

Project_Status	Count	Percentage
Completed	57	57.0%
Proposal	28	28.0%
Ongoing	15	15.0%

Discipline

Discipline	Count	Percentage
MEP	86	86.0%
E Only	9	9.0%
M Only	5	5.0%

View Code

fig, axes = plt.subplots(1, 3, figsize=(14, 5))

# Histogram of values
_ = axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22')
_ = axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold')
_ = axes[0].set_xlabel('Value (₦ Millions)')
_ = axes[0].set_ylabel('Frequency')

# Bar chart: Project Type
type_counts = df_py['Project_Type'].value_counts()
colors_type = ['#00d4ff', '#ff006e', '#ffbe0b']
_ = axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85)
_ = axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold')
_ = axes[1].set_xlabel('Count')

# Bar chart: Status
status_counts = df_py['Project_Status'].value_counts()
colors_status = ['#00ff87', '#ffbe0b', '#8b949e']
_ = axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85)
_ = axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold')
_ = axes[2].set_xlabel('Count')

plt.tight_layout()
plt.show()

Distribution of Project Values (Python/Matplotlib)

6. Data Visualisation

Technique 2 of 5 • Ch. 5

Business Justification: Visualisation transforms CA Consultants’ project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question — and explaining why — is itself an analytical skill.

6.1 Portfolio Value by Building Type

View Code

p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_violin(alpha = 0.3, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") +
  geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Project Value Distribution by Building Type",
       subtitle = "Commercial projects show the widest range; Residential clusters at lower values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

ggplotly(p, tooltip = c("y")) %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average

Why a violin + boxplot? A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts.

6.2 Portfolio Evolution Over Time

View Code

year_type <- df %>%
  count(Project_Year, Project_Type) %>%
  complete(Project_Year, Project_Type, fill = list(n = 0))

ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) +
  geom_area(alpha = 0.7, position = "stack") +
  scale_fill_manual(values = pal_type) +
  labs(title = "CA Consultants Project Volume Over Time",
       subtitle = "Accelerating growth since 2020 driven by commercial engagements",
       x = "Year", y = "Number of Projects", fill = "Building Type") +
  scale_x_continuous(breaks = seq(2006, 2026, 2))

Annual project count and composition showing the firm’s growth trajectory

6.3 Design Complexity: Duration vs. Revisions

View Code

p_scatter <- df %>%
  filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>%
  ggplot(aes(x = Design_Duration_Days, y = Revision_Count,
             size = Value_Millions, colour = Project_Type,
             text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) +
  geom_point(alpha = 0.7) +
  scale_colour_manual(values = pal_type) +
  scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) +
  labs(title = "Design Complexity Landscape",
       subtitle = "Longer design phases and more revisions correlate with higher-value projects",
       x = "Design Duration (Days)", y = "Number of Revisions",
       colour = "Building Type", size = "Value (₦M)")

ggplotly(p_scatter, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Each bubble is a project — size encodes value, colour encodes building type

Why a bubble chart? Three dimensions of project complexity — duration, revision count, and value — are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees.

6.4 Revenue Concentration by Project Size Category

View Code

size_summary <- df %>%
  group_by(Value_Category) %>%
  summarise(
    Count = n(),
    Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9,
    .groups = "drop"
  ) %>%
  mutate(
    Pct_Count = 100 * Count / sum(Count),
    Pct_Value = 100 * Total_Value / sum(Total_Value)
  )

p_conc <- size_summary %>%
  pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>%
  mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>%
  ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) +
  geom_col(position = "dodge", alpha = 0.85, width = 0.6) +
  scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) +
  labs(title = "The Revenue Concentration Paradox",
       subtitle = "Mega-projects are rare but generate a disproportionate share of revenue",
       x = NULL, y = "Percentage (%)", fill = NULL) +
  geom_text(aes(label = paste0(round(Pct, 1), "%")),
            position = position_dodge(width = 0.6), vjust = -0.5,
            colour = "#c9d1d9", size = 3.2)

p_conc

Comparing the number of projects versus the revenue they generate

6.5 Discipline Scope and Project Value

View Code

ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) +
  geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) +
  scale_fill_manual(values = pal_disc) +
  coord_flip() +
  labs(title = "Project Value by Discipline Scope",
       subtitle = "Full-scope MEP projects command significantly higher contract values",
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none") +
  scale_y_continuous(labels = comma_format())

Full MEP scope commands higher values than single-discipline engagements

View Code

import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(14, 11))

# 1. Violin: Value by Type
palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'}
order = ['Commercial', 'Industrial', 'Residential']
_ = sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order,
               palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0])
_ = axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold')
_ = axes[0,0].set_xlabel('')
_ = axes[0,0].set_ylabel('Value (₦ Millions)')

# 2. Timeline
year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0)
_ = year_counts.plot.area(ax=axes[0,1], alpha=0.7,
                       color=[palette_type.get(c, '#888') for c in year_counts.columns])
_ = axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold')
_ = axes[0,1].set_xlabel('Year')
_ = axes[0,1].set_ylabel('Number of Projects')
_ = axes[0,1].legend(fontsize=8)

# 3. Scatter: Duration vs Value
mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna()
for ptype, colour in palette_type.items():
    subset = df_py[mask & (df_py['Project_Type'] == ptype)]
    _ = axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'],
                       c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none')
_ = axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold')
_ = axes[1,0].set_xlabel('Design Duration (Days)')
_ = axes[1,0].set_ylabel('Value (₦ Millions)')
_ = axes[1,0].legend(fontsize=8)

# 4. Box: Value by Discipline
disc_order = ['MEP', 'E Only', 'M Only']
palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'}
_ = sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order,
            palette=palette_disc, ax=axes[1,1])
_ = axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold')
_ = axes[1,1].set_xlabel('')
_ = axes[1,1].set_ylabel('Value (₦ Millions)')

plt.tight_layout()
plt.show()

Python visualisation suite using Seaborn

7. Hypothesis Testing

Technique 3 of 5 • Ch. 6

Business Justification: When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from “it looks like commercial projects are worth more” to “we can be 95% confident that the difference is real and not due to chance.” This is the difference between data-informed strategy and guesswork.

7.1 Hypothesis 1: Project Value Differs by Building Type

H₀: The mean project value is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean project value. Test: Given the right-skewed distribution, we use the Kruskal-Wallis rank-sum test (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison.

View Code

# Check normality within groups (Shapiro-Wilk)
norm_tests <- df %>%
  group_by(Project_Type) %>%
  summarise(
    n = n(),
    shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_,
    .groups = "drop"
  )

norm_tests %>%
  mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>%
  kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Shapiro-Wilk Normality Tests by Group
Project_Type	n	shapiro_p	Normal
Commercial	64	0.0000	No
Industrial	14	0.0075	No
Residential	22	0.0001	No

View Code

# Levene's Test for Homogeneity of Variance (assumption check)
levene_result <- car::leveneTest(Project_Value ~ Project_Type, data = df)
interp <- if_else(levene_result$`Pr(>F)`[1] < 0.05,
          "Variances are significantly unequal &mdash; classical ANOVA is inappropriate",
          "Variances are approximately equal")

htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-cyan); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
  <h4 style="color: var(--accent-cyan); margin: 0 0 0.5rem;">Levene&#8217;s Test for Equal Variances</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;"><strong>F(%d, %d)</strong></td><td>= %.2f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>Interpretation</strong></td><td>%s</td></tr>
  </table>
  <p style="margin: 0.8rem 0 0; color: var(--text-muted); font-size: 0.9rem;">
    Since normality is violated (Shapiro-Wilk), we proceed with the <strong>Kruskal-Wallis</strong> non-parametric test as the primary analysis, and supplement with <strong>Welch&#8217;s ANOVA</strong> (which does not assume equal variances) for robustness.
  </p>
</div>',
  levene_result$Df[1], levene_result$Df[2],
  levene_result$`F value`[1], levene_result$`Pr(>F)`[1], interp)))

Levene’s Test for Equal Variances

F(2, 97)	= 0.60
p-value	= 0.5532
Interpretation	Variances are approximately equal

Since normality is violated (Shapiro-Wilk), we proceed with the Kruskal-Wallis non-parametric test as the primary analysis, and supplement with Welch’s ANOVA (which does not assume equal variances) for robustness.

View Code

# Kruskal-Wallis Test (non-parametric)
kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df)

# Welch ANOVA (does not assume equal variances)
welch_test <- oneway.test(Project_Value ~ Project_Type, data = df, var.equal = FALSE)

# Effect size (epsilon-squared)
kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type)

# Pairwise Wilcoxon with Bonferroni correction
pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni")

agree_txt <- if_else(kw_test$p.value < 0.05 & welch_test$p.value < 0.05,
                    "significant difference exists across building types",
                    "results are consistent")

htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
  <h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (primary &mdash; non-parametric)</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;">&chi;&sup2;</td><td>= %.2f, df = %d</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Effect size (&eta;&sup2;[H])</td><td>= %.3f (%s)</td></tr>
  </table>

  <h4 style="color: var(--accent-amber); margin: 1rem 0 0.5rem;">Welch&#8217;s ANOVA (robustness check)</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;">F(%.1f, %.1f)</td><td>= %.2f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
  </table>
  <p style="margin: 0.8rem 0 0; color: var(--accent-green); font-size: 0.9rem; font-weight: 600;">
    &#10003; Both tests agree: %s
  </p>
</div>',
  kw_test$statistic, kw_test$parameter, kw_test$p.value,
  kw_effect$effsize, kw_effect$magnitude,
  welch_test$parameter[1], welch_test$parameter[2],
  welch_test$statistic, welch_test$p.value, agree_txt)))

Kruskal-Wallis Test (primary — non-parametric)

χ²	= 0.44, df = 2
p-value	= 0.8044
Effect size (η²[H])	= -0.016 (small)

Welch’s ANOVA (robustness check)

F(2.0, 33.6)	= 0.16
p-value	= 0.8493

✓ Both tests agree: results are consistent

View Code

# Display pairwise comparisons as styled table
pw_matrix <- pairwise$p.value
pw_df <- as.data.frame(pw_matrix)
pw_df <- cbind(Group = rownames(pw_df), pw_df)
pw_df[] <- lapply(pw_df, function(x) if(is.numeric(x)) round(x, 4) else x)

htmltools::browsable(htmltools::HTML(paste0(
  '<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-magenta); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">',
  '<h4 style="color: var(--accent-magenta); margin: 0 0 0.5rem;">Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted)</h4>',
  '<p style="color: var(--text-muted); font-size: 0.85rem; margin-bottom: 0.5rem;">Method: ', pairwise$p.adjust.method, '</p>',
  kableExtra::kbl(pw_df, format = "html", row.names = FALSE) %>%
    kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 14) %>%
    kableExtra::row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333"),
  '</div>'
)))

Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted)

Method: bonferroni

Group	Commercial	Industrial
Industrial	1	NA
Residential	1	1

View Code

# Group medians for annotation
group_med <- df %>%
  group_by(Project_Type) %>%
  summarise(med = median(Project_Value / 1e6), .groups = "drop")

ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = group_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ₦", round(med, 1), "M")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Project Value by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_test$p.value,
                          if_else(kw_test$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Project Value (₦ Millions)") +
  guides(fill = "none", colour = "none") +
  scale_y_continuous(labels = comma_format())

Group comparison visualisation for Hypothesis 1

Hypothesis 1 — Result: We reject H₀ at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one.

7.2 Hypothesis 2: Design Duration Differs by Building Type

H₀: The mean design duration is equal across Commercial, Industrial, and Residential projects. H₁: At least one building type has a significantly different mean design duration. Test: Kruskal-Wallis (non-parametric), using only projects with design data.

View Code

df_dur <- df %>% filter(!is.na(Design_Duration_Days))

# Kruskal-Wallis
kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur)
kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type)

htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
  <h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (Design Duration ~ Building Type)</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;">&chi;&sup2;</td><td>= %.2f, df = %d</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Effect size (&eta;&sup2;[H])</td><td>= %.3f (%s)</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Observations used</td><td>%d (excluding supervision-only projects)</td></tr>
  </table>
</div>',
  kw_dur$statistic, kw_dur$parameter, kw_dur$p.value,
  kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur))))

Kruskal-Wallis Test (Design Duration ~ Building Type)

χ²	= 8.02, df = 2
p-value	= 0.0182
Effect size (η²[H])	= 0.068 (moderate)
Observations used	92 (excluding supervision-only projects)

View Code

dur_med <- df_dur %>%
  group_by(Project_Type) %>%
  summarise(med = median(Design_Duration_Days), .groups = "drop")

ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) +
  geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
  geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
  scale_fill_manual(values = pal_type) +
  scale_colour_manual(values = pal_type) +
  geom_text(data = dur_med, aes(x = Project_Type, y = med,
            label = paste0("Median: ", round(med, 0), " days")),
            vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
  labs(title = "Design Duration by Building Type",
       subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
                          kw_dur$p.value,
                          if_else(kw_dur$p.value < 0.05,
                                  "Significant difference detected",
                                  "No significant difference")),
       x = NULL, y = "Design Duration (Days)") +
  guides(fill = "none", colour = "none")

Design duration comparison across building types

Hypothesis 2 — Result: We reject H₀ (p = 0.0182). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.

View Code

from scipy import stats

# Clean numeric columns
if df_py['Project_Value_NGN'].dtype == 'object':
    df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
else:
    df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce')
df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce')
df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce')

# Hypothesis 1: Value ~ Project Type
groups_val = [grp['Project_Value'].dropna().values
              for _, grp in df_py.groupby('Project_Type')]
kw_stat, kw_p = stats.kruskal(*groups_val)

# Hypothesis 1 results
dec1 = "Reject H₀" if kw_p < 0.05 else "Fail to reject H₀"
hyp1 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"],
                      "Value": [f"{kw_stat:.3f}", f"{kw_p:.4f}", dec1]})
print("**Hypothesis 1: Project Value ~ Building Type**\n")
print(hyp1.to_markdown(index=False))

# Hypothesis 2: Duration ~ Project Type
df_dur_py = df_py.dropna(subset=['Design_Duration_Days'])
groups_dur = [grp['Design_Duration_Days'].dropna().values
              for _, grp in df_dur_py.groupby('Project_Type')]
kw_stat2, kw_p2 = stats.kruskal(*groups_dur)
dec2 = "Reject H₀" if kw_p2 < 0.05 else "Fail to reject H₀"

hyp2 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"],
                      "Value": [f"{kw_stat2:.3f}", f"{kw_p2:.4f}", dec2]})
print("\n\n**Hypothesis 2: Design Duration ~ Building Type**\n")
print(hyp2.to_markdown(index=False))

# Group medians
medians = df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_frame("Median Value (₦M)")
print("\n\n**Group Medians — Project Value**\n")
print(medians.to_markdown())

8. Correlation Analysis

Technique 4 of 5 • Ch. 8

Business Justification: Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions.

8.1 Correlation Matrix (Spearman)

View Code

# Use Spearman due to non-normality
corr_data <- df_design %>%
  select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
  rename(
    `Project Value` = Project_Value,
    `Design Duration` = Design_Duration_Days,
    `Revision Count` = Revision_Count,
    `Project Year` = Project_Year
  )

corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs")

ggcorrplot(corr_mat,
           type = "lower",
           lab = TRUE,
           lab_size = 4.5,
           lab_col = "#f0f6fc",
           colors = c("#ff006e", "#161b22", "#00d4ff"),
           outline.color = "#30363d",
           ggtheme = theme_ca_dark()) +
  labs(title = "Spearman Correlation Matrix",
       subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design)))

Correlation heatmap of numeric project variables (Spearman method)

8.2 Pearson vs. Spearman Comparison

View Code

# Compare Pearson (linear, assumes normality) vs Spearman (rank-based, robust)
corr_pearson <- cor(corr_data, method = "pearson", use = "complete.obs")
corr_spearman <- corr_mat

# Build comparison table
vars <- colnames(corr_data)
comparison <- data.frame()
for (i in 1:(length(vars)-1)) {
  for (j in (i+1):length(vars)) {
    comparison <- rbind(comparison, data.frame(
      Variable_1 = vars[i], Variable_2 = vars[j],
      Pearson_r = corr_pearson[i,j],
      Spearman_rho = corr_spearman[i,j],
      Difference = abs(corr_pearson[i,j] - corr_spearman[i,j])
    ))
  }
}

comparison %>%
  arrange(desc(abs(Spearman_rho))) %>%
  mutate(across(where(is.numeric), ~round(., 3)),
         Agreement = if_else(Difference < 0.1, "Strong", "Divergent")) %>%
  kbl(caption = "Pearson vs. Spearman — When they diverge, the relationship may be monotonic but non-linear") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Pearson vs. Spearman — When they diverge, the relationship may be monotonic but non-linear
Variable_1	Variable_2	Pearson_r	Spearman_rho	Difference	Agreement
Design Duration	Revision Count	0.512	0.764	0.252	Divergent
Project Value	Design Duration	0.381	0.676	0.295	Divergent
Project Value	Revision Count	0.481	0.567	0.086	Strong
Project Value	Project Year	0.256	0.369	0.113	Divergent
Revision Count	Project Year	0.046	0.058	0.012	Strong
Design Duration	Project Year	0.017	0.044	0.027	Strong

Why both? Pearson measures linear association; Spearman measures monotonic association. Where they diverge meaningfully (Difference > 0.1), the relationship exists but is non-linear — important for deciding whether to use linear regression or a non-linear model.

8.3 Key Correlations — Business Interpretation

View Code

# Extract and rank correlations
corr_pairs <- as.data.frame(as.table(corr_mat)) %>%
  filter(Var1 != Var2) %>%
  mutate(abs_corr = abs(Freq)) %>%
  arrange(desc(abs_corr)) %>%
  distinct(abs_corr, .keep_all = TRUE) %>%  # remove duplicates
  head(6) %>%
  rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>%
  select(-abs_corr)

corr_pairs %>%
  mutate(Spearman_rho = round(Spearman_rho, 3),
         Strength = case_when(
           abs(Spearman_rho) >= 0.7 ~ "Strong",
           abs(Spearman_rho) >= 0.4 ~ "Moderate",
           TRUE ~ "Weak"
         )) %>%
  kbl(caption = "Ranked Pairwise Correlations") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Ranked Pairwise Correlations
Variable_1	Variable_2	Spearman_rho	Strength
Revision Count	Design Duration	0.764	Strong
Design Duration	Project Value	0.676	Moderate
Revision Count	Project Value	0.567	Moderate
Project Year	Project Value	0.369	Weak
Project Year	Revision Count	0.058	Weak
Project Year	Design Duration	0.044	Weak

8.4 Partial Correlation — Controlling for Confounders

View Code

# Partial correlation: Value ~ Duration, controlling for Revision Count and Year
# Manual computation using residuals (avoids extra package dependency)
partial_cor <- function(x, y, controls, data) {
  resid_x <- residuals(lm(reformulate(controls, response = x), data = data))
  resid_y <- residuals(lm(reformulate(controls, response = y), data = data))
  ct <- cor.test(resid_x, resid_y, method = "pearson")
  list(r = ct$estimate, p = ct$p.value, ci_low = ct$conf.int[1], ci_high = ct$conf.int[2])
}

pc1 <- partial_cor("Project_Value", "Design_Duration_Days",
                    c("Revision_Count", "Project_Year"), df_design)
pc2 <- partial_cor("Project_Value", "Revision_Count",
                    c("Design_Duration_Days", "Project_Year"), df_design)

data.frame(
  Relationship = c("Value ~ Duration | (Revisions, Year)",
                    "Value ~ Revisions | (Duration, Year)"),
  Partial_r = c(pc1$r, pc2$r),
  p_value = c(pc1$p, pc2$p),
  CI_lower = c(pc1$ci_low, pc2$ci_low),
  CI_upper = c(pc1$ci_high, pc2$ci_high)
) %>%
  mutate(across(where(is.numeric), ~round(., 3)),
         Significant = if_else(p_value < 0.05, "Yes", "No")) %>%
  kbl(caption = "Partial Correlations — isolating each variable's unique relationship with Project Value") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Partial Correlations — isolating each variable's unique relationship with Project Value
Relationship	Partial_r	p_value	CI_lower	CI_upper	Significant
Value ~ Duration \| (Revisions, Year)	0.188	0.083	-0.025	0.384	No
Value ~ Revisions \| (Duration, Year)	0.362	0.001	0.162	0.533	Yes

Why partial correlation? Zero-order correlations can be misleading when variables are inter-correlated. Design duration and revision count are themselves correlated — a longer project tends to accumulate more revisions. Partial correlation strips out this confounding to reveal each variable’s independent association with project value. This is critical for fee estimation: if duration’s correlation with value vanishes after controlling for revisions, then duration is merely a proxy — not a true driver.

8.5 Scatterplot: Strongest Correlation

View Code

p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) +
  geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")),
             alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15,
              linetype = "dashed", formula = y ~ x) +
  scale_colour_manual(values = pal_type) +
  labs(title = "Design Duration vs. Project Value",
       subtitle = "Longer design phases are associated with higher-value projects",
       x = "Design Duration (Days)", y = "Project Value (₦ Millions)",
       colour = "Building Type") +
  scale_y_continuous(labels = comma_format())

ggplotly(p_corr, tooltip = "text") %>%
  layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
         font = list(color = '#c9d1d9'))

Relationship between design duration and project value — the strongest observed correlation

Correlation ≠ Causation: The positive correlation between design duration and project value does not necessarily mean spending more time on design causes higher fees. The most plausible causal direction is reverse: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for fee estimation: knowing the expected design duration provides a reasonable proxy for project value.

View Code

# Prepare design-complete subset
df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count'])
corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']
corr_matrix = df_design_py[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(8, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
_ = sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f',
            cmap='coolwarm', center=0, linewidths=1,
            linecolor='#30363d', cbar_kws={'shrink': 0.8},
            annot_kws={'color': '#f0f6fc', 'fontsize': 12},
            ax=ax)
_ = ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

View Code

# Top correlations as styled table
pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        pairs.append((corr_matrix.columns[i], corr_matrix.columns[j],
                       corr_matrix.iloc[i, j]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)

pairs_df = pd.DataFrame(pairs, columns=["Variable 1", "Variable 2", "Spearman ρ"])
pairs_df["Spearman ρ"] = pairs_df["Spearman ρ"].round(3)
print("\n**Top Correlations (by absolute value)**\n")
print(pairs_df.to_markdown(index=False))

9. Linear Regression

Technique 5 of 5 • Ch. 9

Business Justification: Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a predictive model that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: “Given a project’s building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?” Each regression coefficient translates directly into a pricing lever.

9.1 Model Specification

We model log-transformed project value to address the right-skewness identified in the EDA. The predictors are:

Design_Duration_Days — length of design phase (continuous)
Revision_Count — number of design iterations (continuous)
Project_Type — building type (categorical: Commercial, Industrial, Residential)
Discipline — scope of MEP services (categorical)
Project_Year — year initiated (continuous, captures inflation/growth)

View Code

# Fit the model on design-complete observations
model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count +
              Project_Type + Discipline + Project_Year,
            data = df_design)

# Model summary
model_summary <- summary(model)
model_tidy <- tidy(model, conf.int = TRUE)
model_glance <- glance(model)

htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-green); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
  <h4 style="color: var(--accent-green); margin: 0 0 0.5rem;">Model Fit</h4>
  <table style="font-size: 0.95rem; color: var(--text-secondary);">
    <tr><td style="padding: 4px 16px 4px 0;">R&sup2;</td><td>%.3f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Adjusted R&sup2;</td><td>%.3f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">F(%d, %d)</td><td>%.2f, p = %.2e</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Residual SE</td><td>%.3f</td></tr>
    <tr><td style="padding: 4px 16px 4px 0;">Observations</td><td>%d</td></tr>
  </table>
</div>',
  model_glance$r.squared, model_glance$adj.r.squared,
  model_glance$df, model_glance$df.residual,
  model_glance$statistic, model_glance$p.value,
  model_glance$sigma, nrow(df_design))))

Model Fit

R²	0.469
Adjusted R²	0.421
F(7, 78)	9.83, p = 1.00e-08
Residual SE	0.947
Observations	86

9.2 Coefficient Table

View Code

model_tidy %>%
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)),
    p.value = format.pval(p.value, digits = 3),
    Significance = case_when(
      as.numeric(p.value) < 0.001 ~ "***",
      as.numeric(p.value) < 0.01  ~ "**",
      as.numeric(p.value) < 0.05  ~ "*",
      as.numeric(p.value) < 0.1   ~ ".",
      TRUE ~ ""
    )
  ) %>%
  select(Term = term, Estimate = estimate, `Std. Error` = std.error,
         `t-value` = statistic, `p-value` = p.value, Significance,
         `95% CI Low` = conf.low, `95% CI High` = conf.high) %>%
  kbl(caption = "Regression Coefficients — log(Project Value)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Regression Coefficients — log(Project Value)
Term	Estimate	Std. Error	t-value	p-value	Significance	95% CI Low	95% CI High
(Intercept)	-174.4380	55.7227	-3.1305	0.00246	**	-285.3735	-63.5025
Design_Duration_Days	0.0027	0.0014	1.9967	0.04935	*	0.0000	0.0054
Revision_Count	0.3238	0.0693	4.6744	1.21e-05	***	0.1859	0.4616
Project_TypeIndustrial	0.6776	0.4196	1.6149	0.11037		-0.1577	1.5129
Project_TypeResidential	-0.1796	0.2590	-0.6937	0.48993		-0.6952	0.3359
DisciplineM Only	-0.1583	0.7053	-0.2245	0.82298		-1.5624	1.2458
DisciplineMEP	0.3953	0.4408	0.8967	0.37264		-0.4823	1.2729
Project_Year	0.0941	0.0276	3.4113	0.00103	**	0.0392	0.1490

9.3 Coefficient Plot

View Code

model_tidy %>%
  filter(term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(x = estimate, y = term)) +
  geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") +
  geom_point(colour = "#00d4ff", size = 3) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.2, colour = "#00d4ff", linewidth = 0.8) +
  labs(title = "Regression Coefficient Estimates",
       subtitle = "log(Project Value) model — points show estimate, bars show 95% CI",
       x = "Estimate (log scale)", y = NULL)

Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant

9.4 Diagnostic Plots

View Code

par(mfrow = c(2, 2),
    bg = "#161b22", fg = "#c9d1d9",
    col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc")
plot(model, col = "#00d4ff", pch = 16, cex = 0.8,
     sub.caption = "")

Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points

View Code

# Variance Inflation Factors
vif_vals <- car::vif(model)
vif_df <- if (is.matrix(vif_vals)) {
  data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2),
             Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2))
} else {
  data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2))
}

vif_df %>%
  kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Variance Inflation Factors — checking for multicollinearity
	Term	GVIF	Df	GVIF..1.2Df.
Design_Duration_Days	Design_Duration_Days	1.46	1	1.21
Revision_Count	Revision_Count	1.42	1	1.19
Project_Type	Project_Type	1.19	2	1.05
Discipline	Discipline	1.07	2	1.02
Project_Year	Project_Year	1.05	1	1.03

9.5 Influential Observations — Cook’s Distance

View Code

cooks_d <- cooks.distance(model)
n <- nrow(df_design)
k <- length(coef(model))
threshold <- 4 / n  # common threshold

cooks_df <- data.frame(
  Index = 1:n,
  Project_ID = df_design$Project_ID,
  Cooks_D = cooks_d,
  Influential = cooks_d > threshold
)

ggplot(cooks_df, aes(x = Index, y = Cooks_D)) +
  geom_segment(aes(xend = Index, y = 0, yend = Cooks_D,
                    colour = Influential), linewidth = 0.7) +
  geom_point(aes(colour = Influential), size = 2) +
  geom_hline(yintercept = threshold, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.7) +
  annotate("text", x = n * 0.85, y = threshold + max(cooks_d) * 0.05,
           label = paste0("Threshold = 4/n = ", round(threshold, 3)),
           colour = "#ffbe0b", size = 3.5) +
  scale_colour_manual(values = c("FALSE" = "#00d4ff", "TRUE" = "#ff006e"),
                      labels = c("Normal", "Influential")) +
  labs(title = "Cook's Distance — Influential Observation Detection",
       subtitle = sprintf("%d of %d observations exceed the 4/n threshold",
                          sum(cooks_df$Influential), n),
       x = "Observation Index", y = "Cook's Distance", colour = NULL)

Cook’s Distance identifies observations with outsized influence on regression coefficients

View Code

# Show the most influential observations
cooks_df %>%
  filter(Influential) %>%
  left_join(df_design %>% mutate(Index = row_number()) %>%
              select(Index, Project_ID, Project_Type, Value_Millions, Design_Duration_Days),
            by = c("Index", "Project_ID")) %>%
  arrange(desc(Cooks_D)) %>%
  mutate(Cooks_D = round(Cooks_D, 4), Value_Millions = round(Value_Millions, 1)) %>%
  head(10) %>%
  kbl(caption = "Most Influential Observations — these projects disproportionately shape the regression line") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")

Most Influential Observations — these projects disproportionately shape the regression line
Index	Project_ID	Cooks_D	Influential	Project_Type	Value_Millions	Design_Duration_Days
31	CA639	0.1124	TRUE	Commercial	2.9	4
64	CA610	0.1110	TRUE	Commercial	0.3	5
81	CA509	0.0888	TRUE	Commercial	2.7	40
77	CA594	0.0828	TRUE	Commercial	36.2	40
39	CA042/PJ022	0.0578	TRUE	Commercial	65.2	370
30	CA687	0.0509	TRUE	Industrial	10.9	10
24	CA716	0.0481	TRUE	Industrial	97.2	50
21	CA669	0.0474	TRUE	Industrial	98.1	40

Why Cook’s Distance matters: In a small dataset of 86 observations, a single mega-project can tilt the entire regression line. Identifying influential points ensures our pricing recommendations are not driven by one or two outlier contracts. If removing an influential point materially changes the coefficients, we should investigate whether it represents an unusual circumstance (e.g., a rush project, an atypical client) rather than a general pattern.

9.6 Business Interpretation

Interpreting Key Coefficients (on the log scale → percentage change in value):

Design Duration: Each additional design day is associated with a 0.3%% change in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦1.4M in additional value.
Revision Count: Each additional revision is associated with a 38.2%% change in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.
Project Year: The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.

Actionable Recommendation: The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a fee estimation formula anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula.

View Code

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Prepare data
df_reg_py = df_design_py.copy()
df_reg_py['log_value'] = np.log(df_reg_py['Project_Value'])

# Fit OLS model
formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year'
model_py = smf.ols(formula, data=df_reg_py).fit()

# Model fit summary
print("**Model Fit Summary**\n")
fit_df = pd.DataFrame({
    "Metric": ["R-squared", "Adj. R-squared", "F-statistic", "Prob (F-statistic)",
               "Log-Likelihood", "AIC", "BIC", "Observations", "Df Residuals"],
    "Value": [f"{model_py.rsquared:.4f}", f"{model_py.rsquared_adj:.4f}",
              f"{model_py.fvalue:.3f}", f"{model_py.f_pvalue:.2e}",
              f"{model_py.llf:.2f}", f"{model_py.aic:.1f}", f"{model_py.bic:.1f}",
              f"{int(model_py.nobs)}", f"{int(model_py.df_resid)}"]
})
print(fit_df.to_markdown(index=False))

# Coefficient table
print("\n\n**Coefficient Table**\n")
coef_df = pd.DataFrame({
    "Term": model_py.params.index,
    "Coefficient": model_py.params.round(4).values,
    "Std Error": model_py.bse.round(4).values,
    "t-value": model_py.tvalues.round(3).values,
    "p-value": model_py.pvalues.round(4).values,
    "CI Lower": model_py.conf_int()[0].round(3).values,
    "CI Upper": model_py.conf_int()[1].round(3).values,
})
coef_df["Sig"] = coef_df["p-value"].apply(lambda p: "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "")
print(coef_df.to_markdown(index=False))

# Diagnostics
print("\n\n**Residual Diagnostics**\n")
from scipy.stats import jarque_bera
jb_result = jarque_bera(model_py.resid)
resid_skew = float(model_py.resid.skew())
resid_kurt = float(model_py.resid.kurtosis() + 3)
diag_df = pd.DataFrame({
    "Diagnostic": ["Durbin-Watson", "Jarque-Bera", "Prob(JB)", "Skewness", "Kurtosis", "Condition No."],
    "Value": [f"{sm.stats.durbin_watson(model_py.resid):.3f}",
              f"{jb_result.statistic:.3f}", f"{jb_result.pvalue:.2e}",
              f"{resid_skew:.3f}", f"{resid_kurt:.3f}",
              f"{np.linalg.cond(model_py.model.exog):.0f}"]
})
print(diag_df.to_markdown(index=False))

View Code

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs Fitted
fitted = model_py.fittedvalues
resids = model_py.resid
_ = axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none')
_ = axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7)
_ = axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold')
_ = axes[0].set_xlabel('Fitted Values')
_ = axes[0].set_ylabel('Residuals')

# Q-Q Plot
from scipy.stats import probplot
_ = probplot(resids, plot=axes[1])
axes[1].get_lines()[0].set_color('#00d4ff')
axes[1].get_lines()[0].set_markersize(5)
axes[1].get_lines()[1].set_color('#ff006e')
_ = axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold')

plt.tight_layout()
plt.show()

Residual diagnostic plots (Python/statsmodels)

10. Integrated Findings

The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants’ project portfolio and the drivers of project value:

1. The Portfolio is Concentrated and Skewed. EDA revealed that the firm’s revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact).

2. Commercial Projects Are Statistically More Valuable. Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification.

3. Design Complexity Predicts Value. Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope.

4. Full-Scope MEP Engagements Outperform. Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts.

Single Integrated Recommendation

CA Consultants should anchor its growth strategy on full-scope MEP design for commercial buildings, using the regression model’s coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect.

11. Limitations & Further Work

Data limitations:

The dataset contains 100 observations spanning 20 years. While this represents the firm’s full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse.
Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to 86 observations.
Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling.
The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value.

Methodological limitations:

Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives.
The regression model uses cross-sectional data without controls for inflation or market conditions; the Project_Year variable is a rough proxy.
Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence.

Future work with more data, time, or computing power:

Collect client industry codes and location data (mainland Lagos vs. island) to model geographic and sectoral pricing effects.
Build a time-series model of project pipeline values to forecast quarterly revenue.
Introduce design team size and junior/senior engineer ratio as predictors to model the labour-cost side of project economics.
Develop an interactive dashboard (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Mba, C. (2026). CA Consultants Limited MEP project register (2006–2026) [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 (R package). https://CRAN.R-project.org/package=ggcorrplot

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedersen, T. L. (2024). patchwork: The composer of plots (R package). https://CRAN.R-project.org/package=patchwork

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package). https://CRAN.R-project.org/package=rstatix

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Appendix A: GitHub Repository (+5 Bonus Marks)

GitHub Repository

The complete source code and data for this analysis are publicly available at:

https://github.com/chidubemsbox-art/Dubydarego

The repository contains:

CA_Consultants_MEP_Analysis.qmd — the Quarto source document
mep_projects.csv — the anonymised project dataset
README.md — reproduction instructions

To reproduce: clone the repo, open the .qmd in RStudio, and click Render.

Appendix B: AI Usage Statement

This submission was prepared with the assistance of Claude (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality).

All analytical decisions were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants’ internal systems. The author can explain every line of code and every result in this document.

--- title: "Building By The Numbers" subtitle: "What 101 Projects Reveal About Design, Complexity, and Revenue at CA Consultants Limited" author: "Chidubem Mba — Senior Project Engineer & Project Manager" date: today format: html: theme: darkly toc: true toc-depth: 3 toc-title: "Navigation" code-fold: true code-tools: true code-summary: "View Code" self-contained: true smooth-scroll: true fig-align: center fig-width: 10 fig-height: 6 execute: warning: false message: false echo: true --- ```{=html} <style> /* ===== CUSTOM DARK MODERN THEME ===== */ /* Google Fonts */ @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap'); /* Root Variables */ :root { --bg-primary: #0d1117; --bg-secondary: #161b22; --bg-tertiary: #1c2333; --border-color: #30363d; --border-accent: #58a6ff; --text-primary: #f0f6fc; --text-secondary: #c9d1d9; --text-muted: #8b949e; --accent-cyan: #00d4ff; --accent-magenta: #ff006e; --accent-amber: #ffbe0b; --accent-purple: #7928ca; --accent-green: #00ff87; --gradient-main: linear-gradient(135deg, #00d4ff 0%, #7928ca 50%, #ff006e 100%); --gradient-warm: linear-gradient(135deg, #ff006e 0%, #ffbe0b 100%); --gradient-cool: linear-gradient(135deg, #00d4ff 0%, #00ff87 100%); --shadow-glow: 0 0 30px rgba(0, 212, 255, 0.15); --shadow-card: 0 8px 32px rgba(0, 0, 0, 0.4); } /* Global */ body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif !important; background: var(--bg-primary) !important; color: var(--text-secondary) !important; line-height: 1.7; letter-spacing: 0.01em; } /* ===== CONSTRUCTION BLUEPRINT BACKGROUND ===== */ body::before { content: ''; position: fixed; top: 0; left: 0; width: 100%; height: 100%; pointer-events: none; z-index: -1; background: /* Subtle blueprint grid */ linear-gradient(rgba(0,212,255,0.03) 1px, transparent 1px), linear-gradient(90deg, rgba(0,212,255,0.03) 1px, transparent 1px), /* Larger grid squares */ linear-gradient(rgba(0,212,255,0.015) 2px, transparent 2px), linear-gradient(90deg, rgba(0,212,255,0.015) 2px, transparent 2px); background-size: 20px 20px, 20px 20px, 100px 100px, 100px 100px; } /* Faint engineering schematic accents on the side */ body::after { content: ''; position: fixed; top: 0; right: 0; width: 300px; height: 100%; pointer-events: none; z-index: -1; background: radial-gradient(circle at 80% 20%, rgba(0,212,255,0.04) 0%, transparent 50%), radial-gradient(circle at 60% 70%, rgba(121,40,202,0.03) 0%, transparent 40%), radial-gradient(circle at 90% 90%, rgba(255,0,110,0.02) 0%, transparent 30%); } /* ===== DOWNLOAD DATA BUTTON ===== */ .download-data-btn { display: inline-flex; align-items: center; gap: 8px; padding: 10px 22px; background: linear-gradient(135deg, rgba(0,212,255,0.12), rgba(121,40,202,0.10)); border: 1px solid rgba(0,212,255,0.3); border-radius: 8px; color: var(--accent-cyan); font-family: 'Inter', sans-serif; font-size: 0.9rem; font-weight: 600; cursor: pointer; text-decoration: none; transition: all 0.3s ease; margin: 1rem 0; } .download-data-btn:hover { background: linear-gradient(135deg, rgba(0,212,255,0.22), rgba(121,40,202,0.18)); border-color: var(--accent-cyan); box-shadow: 0 0 20px rgba(0,212,255,0.2); transform: translateY(-1px); color: #ffffff; text-decoration: none; } .download-data-btn svg { width: 18px; height: 18px; fill: currentColor; } /* ===== MARGIN GLOW STRIPS + RULER MARKS ===== */ .margin-decor { position: fixed; top: 0; height: 100%; pointer-events: none; z-index: 100; display: flex; } .margin-decor.left { left: 0; } .margin-decor.right { right: 0; } .margin-decor .glow-strip { width: 4px; } .margin-decor.left .glow-strip { background: linear-gradient(180deg, #00d4ff 0%, #7928ca 40%, #ff006e 70%, transparent 100%); box-shadow: 0 0 12px rgba(0,212,255,0.3), 0 0 24px rgba(121,40,202,0.15); } .margin-decor.right .glow-strip { background: linear-gradient(180deg, transparent 0%, #ff006e 30%, #7928ca 60%, #00d4ff 100%); box-shadow: 0 0 12px rgba(255,0,110,0.3), 0 0 24px rgba(121,40,202,0.15); } .margin-decor .ruler { width: 20px; position: relative; background: repeating-linear-gradient(180deg, transparent 0px, transparent 19px, rgba(0,212,255,0.1) 19px, rgba(0,212,255,0.1) 20px), repeating-linear-gradient(180deg, transparent 0px, transparent 79px, rgba(0,212,255,0.22) 79px, rgba(0,212,255,0.22) 80px); } .margin-decor.left .ruler { border-right: 1px solid rgba(0,212,255,0.1); } .margin-decor.right .ruler { border-left: 1px solid rgba(0,212,255,0.1); } .margin-decor .ruler-number { position: absolute; font-size: 7px; color: rgba(0,212,255,0.3); font-family: 'JetBrains Mono', monospace; } .margin-decor.left .ruler-number { right: 4px; } .margin-decor.right .ruler-number { left: 4px; } /* ===== LOGO (positioned next to date) ===== */ .quarto-title-block { position: relative; padding: 1.5rem 0 2rem 0; margin-bottom: 2rem; border-bottom: 2px solid transparent; border-image: var(--gradient-main) 1; } /* ===== FIRM BRANDING BAR ===== */ .firm-branding-bar { background: linear-gradient(135deg, rgba(0,212,255,0.08), rgba(121,40,202,0.06)); border-bottom: 1px solid rgba(0,212,255,0.2); padding: 12px 2rem; display: flex; align-items: center; justify-content: space-between; position: relative; z-index: 10; } .firm-branding-bar .firm-left { display: flex; align-items: center; gap: 14px; } .firm-branding-bar .firm-logo { max-width: 70px; height: auto; border-radius: 4px; filter: drop-shadow(0 2px 8px rgba(0,212,255,0.15)); transition: transform 0.3s ease; } .firm-branding-bar .firm-logo:hover { transform: scale(1.05); } .firm-branding-bar .firm-name { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.1rem; color: var(--accent-cyan); letter-spacing: 0.02em; } .firm-branding-bar .firm-tagline { font-size: 0.78rem; color: var(--text-muted); font-style: italic; margin-top: 2px; } .firm-branding-bar .firm-location { font-size: 0.78rem; color: var(--text-muted); letter-spacing: 0.02em; } .quarto-title .title { font-family: 'Inter', sans-serif !important; font-weight: 800 !important; font-size: 2.8rem !important; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; line-height: 1.2 !important; margin-bottom: 0.8rem !important; } .quarto-title .subtitle { font-size: 1.15rem !important; font-style: italic !important; font-weight: 300 !important; letter-spacing: 0.04em; text-transform: none !important; background: linear-gradient(90deg, var(--accent-cyan), var(--accent-amber), var(--accent-magenta)); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; background-size: 200% auto; animation: subtitleShimmer 4s ease-in-out infinite; } @keyframes subtitleShimmer { 0% { background-position: 0% center; } 50% { background-position: 100% center; } 100% { background-position: 0% center; } } /* Section Headings */ h1, .h1 { font-family: 'Inter', sans-serif !important; font-weight: 700 !important; font-size: 2rem !important; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; padding-top: 1.5rem; margin-top: 3rem !important; border-top: 1px solid var(--border-color); } h2, .h2 { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 1.5rem !important; color: var(--accent-cyan) !important; margin-top: 2rem !important; } h3, .h3 { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 1.2rem !important; color: var(--text-primary) !important; } /* Main Content Container */ #quarto-content { background: var(--bg-primary); } .content { max-width: 1100px; } /* Cards / Section Containers */ .card-section { background: var(--bg-secondary); border: 1px solid var(--border-color); border-radius: 12px; padding: 2rem; margin: 1.5rem 0; box-shadow: var(--shadow-card); transition: border-color 0.3s ease, box-shadow 0.3s ease; } .card-section:hover { border-color: var(--border-accent); box-shadow: var(--shadow-glow); } /* Metric Cards Grid */ .metric-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 1.2rem; margin: 2rem 0; } .metric-card { background: var(--bg-secondary); border: 1px solid var(--border-color); border-radius: 12px; padding: 1.5rem; text-align: center; position: relative; overflow: hidden; transition: all 0.3s ease; } .metric-card::before { content: ''; position: absolute; top: 0; left: 0; right: 0; height: 3px; background: var(--gradient-main); } .metric-card:hover { transform: translateY(-4px); box-shadow: var(--shadow-glow); border-color: var(--accent-cyan); } .metric-label { display: block; font-size: 0.8rem; text-transform: uppercase; letter-spacing: 0.1em; color: var(--text-muted); margin-bottom: 0.5rem; font-weight: 500; } .metric-value { display: block; font-size: 2rem; font-weight: 700; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; } .metric-sub { display: block; font-size: 0.85rem; color: var(--text-muted); margin-top: 0.3rem; } /* Callout Boxes */ .callout { background: var(--bg-secondary) !important; border-radius: 10px !important; border-left: 4px solid var(--accent-cyan) !important; box-shadow: var(--shadow-card) !important; } .callout-note { border-left-color: var(--accent-cyan) !important; } .callout-important { border-left-color: var(--accent-magenta) !important; } .callout-tip { border-left-color: var(--accent-green) !important; } /* Tab Styling */ .nav-tabs { border-bottom: 2px solid var(--border-color) !important; margin-bottom: 1.5rem; } .nav-tabs .nav-link { font-family: 'Inter', sans-serif !important; font-weight: 600 !important; font-size: 0.95rem !important; color: var(--text-muted) !important; border: none !important; padding: 0.8rem 1.5rem !important; border-radius: 8px 8px 0 0 !important; transition: all 0.3s ease !important; text-transform: uppercase; letter-spacing: 0.05em; } .nav-tabs .nav-link:hover { color: var(--accent-cyan) !important; background: rgba(0, 212, 255, 0.08) !important; } .nav-tabs .nav-link.active { color: var(--accent-cyan) !important; background: var(--bg-secondary) !important; border-bottom: 3px solid var(--accent-cyan) !important; } /* Code Blocks */ pre { background: var(--bg-primary) !important; border: 1px solid var(--border-color) !important; border-radius: 10px !important; padding: 1.2rem !important; font-family: 'JetBrains Mono', 'Fira Code', monospace !important; font-size: 0.85rem !important; box-shadow: inset 0 2px 8px rgba(0,0,0,0.3); } code { font-family: 'JetBrains Mono', 'Fira Code', monospace !important; } .code-fold-btn-container .btn { font-family: 'Inter', sans-serif !important; font-size: 0.8rem !important; color: var(--accent-cyan) !important; border-color: var(--border-color) !important; background: var(--bg-secondary) !important; border-radius: 6px !important; } /* Tables */ .table { background: var(--bg-secondary) !important; color: var(--text-secondary) !important; border-color: var(--border-color) !important; } .table thead th { background: var(--bg-tertiary) !important; color: var(--accent-cyan) !important; border-color: var(--border-color) !important; font-weight: 600; text-transform: uppercase; font-size: 0.8rem; letter-spacing: 0.05em; } .table td { border-color: var(--border-color) !important; } .dataTables_wrapper { color: var(--text-secondary) !important; } .dataTables_wrapper .dataTables_filter input, .dataTables_wrapper .dataTables_length select { background: var(--bg-primary) !important; color: var(--text-secondary) !important; border: 1px solid var(--border-color) !important; border-radius: 6px !important; } .dataTables_wrapper .dataTables_paginate .paginate_button { color: var(--text-secondary) !important; } .dataTables_wrapper .dataTables_paginate .paginate_button.current { background: var(--accent-cyan) !important; color: var(--bg-primary) !important; border: none !important; border-radius: 6px !important; } /* TOC Sidebar */ #TOC { background: var(--bg-secondary) !important; border-right: 1px solid var(--border-color) !important; padding: 1rem !important; } #TOC a { color: var(--text-muted) !important; transition: color 0.2s ease; } #TOC a:hover, #TOC a.active { color: var(--accent-cyan) !important; } /* Scrollbar */ ::-webkit-scrollbar { width: 8px; height: 8px; } ::-webkit-scrollbar-track { background: var(--bg-primary); } ::-webkit-scrollbar-thumb { background: var(--border-color); border-radius: 4px; } ::-webkit-scrollbar-thumb:hover { background: var(--accent-cyan); } /* Plotly toolbar */ .modebar { background: transparent !important; } /* Smooth section reveals */ section { animation: fadeInUp 0.6s ease-out; } @keyframes fadeInUp { from { opacity: 0; transform: translateY(20px); } to { opacity: 1; transform: translateY(0); } } /* Business insight callout */ .insight-box { background: linear-gradient(135deg, rgba(0,212,255,0.08), rgba(121,40,202,0.08)); border: 1px solid rgba(0,212,255,0.3); border-radius: 10px; padding: 1.5rem; margin: 1.5rem 0; position: relative; } .insight-box::before { content: '💡'; position: absolute; top: -12px; left: 16px; font-size: 1.5rem; background: var(--bg-primary); padding: 0 0.5rem; } /* Hypothesis result box */ .result-box { background: var(--bg-secondary); border-left: 4px solid var(--accent-green); border-radius: 0 10px 10px 0; padding: 1.2rem 1.5rem; margin: 1rem 0; } .result-box.reject { border-left-color: var(--accent-magenta); } /* Image and figure styling */ .quarto-figure { background: var(--bg-secondary); border-radius: 12px; padding: 1rem; border: 1px solid var(--border-color); } /* Blockquotes */ blockquote { border-left: 3px solid var(--accent-purple) !important; background: rgba(121, 40, 202, 0.05) !important; padding: 1rem 1.5rem !important; border-radius: 0 8px 8px 0 !important; font-style: italic; color: var(--text-muted) !important; } /* Selection colour */ ::selection { background: rgba(0, 212, 255, 0.3); color: var(--text-primary); } /* ===== SCROLL PROGRESS BAR ===== */ .progress-bar-container { position: fixed; top: 0; left: 0; width: 100%; height: 3px; z-index: 9999; background: var(--bg-primary); } .progress-bar-fill { height: 100%; width: 0%; background: var(--gradient-main); transition: width 0.1s ease-out; border-radius: 0 2px 2px 0; box-shadow: 0 0 10px rgba(0, 212, 255, 0.5); } /* ===== BACK TO TOP BUTTON ===== */ .back-to-top { position: fixed; bottom: 2rem; right: 2rem; width: 48px; height: 48px; border-radius: 50%; background: var(--gradient-main); border: none; cursor: pointer; display: flex; align-items: center; justify-content: center; opacity: 0; visibility: hidden; transform: translateY(20px); transition: all 0.3s ease; z-index: 999; box-shadow: 0 4px 15px rgba(0, 212, 255, 0.3); } .back-to-top.visible { opacity: 1; visibility: visible; transform: translateY(0); } .back-to-top:hover { transform: translateY(-3px); box-shadow: 0 6px 25px rgba(0, 212, 255, 0.5); } .back-to-top svg { width: 24px; height: 24px; fill: white; } /* ===== HERO BANNER ===== */ .hero-banner { position: relative; padding: 2rem; margin: -1rem -1rem 2rem -1rem; border-radius: 16px; background: linear-gradient(-45deg, #0d1117, #161b22, #1a1040, #0d2137); background-size: 400% 400%; animation: gradientShift 8s ease infinite; border: 1px solid var(--border-color); overflow: hidden; } .hero-banner::before { content: ''; position: absolute; top: -50%; left: -50%; width: 200%; height: 200%; background: radial-gradient(circle, rgba(0,212,255,0.05) 0%, transparent 50%); animation: pulseGlow 4s ease-in-out infinite; } @keyframes gradientShift { 0% { background-position: 0% 50%; } 50% { background-position: 100% 50%; } 100% { background-position: 0% 50%; } } @keyframes pulseGlow { 0%, 100% { transform: scale(1); opacity: 0.5; } 50% { transform: scale(1.1); opacity: 1; } } /* ===== ANIMATED COUNTER ===== */ .metric-value[data-target] { font-variant-numeric: tabular-nums; } /* ===== ENHANCED SECTION TRANSITIONS ===== */ .section-enter { opacity: 0; transform: translateY(30px); transition: opacity 0.6s ease-out, transform 0.6s ease-out; } .section-enter.visible { opacity: 1; transform: translateY(0); } /* ===== TECHNIQUE BADGE ===== */ .technique-badge { display: inline-block; padding: 0.3rem 1rem; border-radius: 20px; font-size: 0.75rem; font-weight: 600; text-transform: uppercase; letter-spacing: 0.08em; margin-bottom: 1rem; } .badge-eda { background: rgba(0,212,255,0.15); color: #00d4ff; border: 1px solid rgba(0,212,255,0.3); } .badge-viz { background: rgba(255,0,110,0.15); color: #ff006e; border: 1px solid rgba(255,0,110,0.3); } .badge-hyp { background: rgba(255,190,11,0.15); color: #ffbe0b; border: 1px solid rgba(255,190,11,0.3); } .badge-cor { background: rgba(121,40,202,0.15); color: #a855f7; border: 1px solid rgba(121,40,202,0.3); } .badge-reg { background: rgba(0,255,135,0.15); color: #00ff87; border: 1px solid rgba(0,255,135,0.3); } /* ===== STAT RESULT HIGHLIGHT ===== */ .stat-highlight { display: inline-block; background: rgba(0,212,255,0.1); border: 1px solid rgba(0,212,255,0.2); padding: 0.2rem 0.6rem; border-radius: 6px; font-family: 'JetBrains Mono', monospace; font-size: 0.9rem; color: var(--accent-cyan); } /* ===== CROSSTALK FILTER STYLING ===== */ .crosstalk-input-select select, .crosstalk-input-checkboxgroup label { background: var(--bg-primary) !important; color: var(--text-secondary) !important; border: 1px solid var(--border-color) !important; border-radius: 6px; padding: 0.4rem; } .crosstalk-input-select label, .crosstalk-input-checkboxgroup label { color: var(--text-muted) !important; font-weight: 500; text-transform: uppercase; font-size: 0.75rem; letter-spacing: 0.05em; } /* ===== PROFESSIONAL FOOTER ===== */ .document-footer { margin-top: 4rem; padding: 2rem 0; border-top: 2px solid transparent; border-image: var(--gradient-main) 1; text-align: center; color: var(--text-muted); font-size: 0.9rem; line-height: 1.8; } .document-footer .footer-name { font-weight: 700; font-size: 1.05rem; background: var(--gradient-main); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; } .document-footer .footer-firm { color: var(--accent-cyan); font-weight: 500; } .document-footer .footer-divider { width: 80px; height: 2px; background: var(--gradient-main); margin: 1rem auto; border-radius: 2px; } .document-footer .footer-programme { font-style: italic; color: var(--text-secondary); font-size: 0.85rem; max-width: 600px; margin: 0.5rem auto 0; } /* Print-friendly overrides */ @media print { body { background: white !important; color: black !important; } h1, h2, h3 { -webkit-text-fill-color: #1a1a2e !important; } .metric-value { -webkit-text-fill-color: #1a1a2e !important; } .progress-bar-container, .back-to-top, .margin-decor { display: none !important; } .document-footer .footer-name { -webkit-text-fill-color: #1a1a2e !important; } } </style> ``` ```{r} #| label: embed-logo #| echo: false #| results: asis logo_path <- "ca_logo.png" if (file.exists(logo_path)) { logo_b64 <- base64enc::base64encode(logo_path) logo_uri <- paste0("data:image/png;base64,", logo_b64) } else { logo_uri <- "" } htmltools::browsable(htmltools::HTML(sprintf( '<script>var caLogoDataURI = "%s";</script>', logo_uri ))) ``` ```{=html}  <div class="progress-bar-container"><div class="progress-bar-fill" id="progressBar"></div></div>  <button class="back-to-top" id="backToTop" onclick="window.scrollTo({top:0,behavior:'smooth'})"> <svg viewBox="0 0 24 24"><path d="M12 4l-8 8h5v8h6v-8h5z"/></svg> </button> <script> // ---- Scroll Progress Bar ---- window.addEventListener('scroll', function() { var scrollTop = window.scrollY; var docHeight = document.documentElement.scrollHeight - window.innerHeight; var progress = (scrollTop / docHeight) * 100; document.getElementById('progressBar').style.width = progress + '%'; // Back to top visibility var btn = document.getElementById('backToTop'); if (scrollTop > 500) { btn.classList.add('visible'); } else { btn.classList.remove('visible'); } }); // ---- Animated Counters ---- function animateCounter(el) { var target = parseFloat(el.getAttribute('data-target')); var prefix = el.getAttribute('data-prefix') || ''; var suffix = el.getAttribute('data-suffix') || ''; var decimals = parseInt(el.getAttribute('data-decimals')) || 0; var duration = 1500; var startTime = null; function step(timestamp) { if (!startTime) startTime = timestamp; var progress = Math.min((timestamp - startTime) / duration, 1); var eased = 1 - Math.pow(1 - progress, 3); // ease-out cubic var current = eased * target; el.textContent = prefix + current.toFixed(decimals) + suffix; if (progress < 1) requestAnimationFrame(step); } requestAnimationFrame(step); } // Intersection Observer for counters var counterObserver = new IntersectionObserver(function(entries) { entries.forEach(function(entry) { if (entry.isIntersecting && !entry.target.classList.contains('counted')) { entry.target.classList.add('counted'); animateCounter(entry.target); } }); }, { threshold: 0.5 }); document.addEventListener('DOMContentLoaded', function() { // Inject margin glow strips + ruler marks function createMarginDecor(side) { var decor = document.createElement('div'); decor.className = 'margin-decor ' + side; var glow = document.createElement('div'); glow.className = 'glow-strip'; var ruler = document.createElement('div'); ruler.className = 'ruler'; for (var i = 1; i <= 20; i++) { var num = document.createElement('span'); num.className = 'ruler-number'; num.style.top = (i * 80 - 2) + 'px'; num.textContent = i; ruler.appendChild(num); } if (side === 'left') { decor.appendChild(glow); decor.appendChild(ruler); } else { decor.appendChild(ruler); decor.appendChild(glow); } document.body.appendChild(decor); } createMarginDecor('left'); createMarginDecor('right'); // Inject firm branding bar at the very top of the page var brandingBar = document.createElement('div'); brandingBar.className = 'firm-branding-bar'; brandingBar.innerHTML = '' + '<div class="firm-left">' + '<img src="' + (typeof caLogoDataURI !== 'undefined' && caLogoDataURI ? caLogoDataURI : 'ca_logo.png') + '" alt="CA Consultants Limited" class="firm-logo" />' + '<div>' + '<div class="firm-name">CA Consultants Limited</div>' + '<div class="firm-tagline">MEP Engineering • Design • Supervision</div>' + '</div>' + '</div>' + '<div class="firm-location">Lagos, Nigeria</div>'; var mainContent = document.querySelector('#quarto-content') || document.querySelector('main') || document.body.firstChild; if (mainContent) { mainContent.parentNode.insertBefore(brandingBar, mainContent); } else { document.body.insertBefore(brandingBar, document.body.firstChild); } // Animated counters document.querySelectorAll('.metric-value[data-target]').forEach(function(el) { counterObserver.observe(el); }); // Section enter animations var sectionObserver = new IntersectionObserver(function(entries) { entries.forEach(function(entry) { if (entry.isIntersecting) entry.target.classList.add('visible'); }); }, { threshold: 0.1 }); document.querySelectorAll('section').forEach(function(s) { s.classList.add('section-enter'); sectionObserver.observe(s); }); }); </script> ```    ```{r} #| label: setup #| include: false # ---- Install missing packages (runs silently) ---- required_pkgs <- c( "tidyverse", "plotly", "DT", "ggcorrplot", "car", "broom", "scales", "kableExtra", "patchwork", "rstatix", "moments", "viridis", "effectsize", "crosstalk", "htmltools", "base64enc" ) invisible(lapply(required_pkgs, function(pkg) { if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg, quiet = TRUE) })) # ---- Load libraries ---- library(tidyverse) library(plotly) library(DT) library(ggcorrplot) library(car) library(broom) library(scales) library(kableExtra) library(patchwork) library(rstatix) library(moments) library(viridis) library(effectsize) # ---- Custom ggplot2 dark theme ---- theme_ca_dark <- function(base_size = 13) { theme_minimal(base_size = base_size) %+replace% theme( plot.background = element_rect(fill = "#161b22", colour = NA), panel.background = element_rect(fill = "#161b22", colour = NA), panel.grid.major = element_line(colour = "#30363d", linewidth = 0.3), panel.grid.minor = element_blank(), text = element_text(colour = "#c9d1d9", family = "sans"), plot.title = element_text(colour = "#f0f6fc", size = 16, face = "bold", margin = margin(b = 8)), plot.subtitle = element_text(colour = "#8b949e", size = 12, margin = margin(b = 12)), plot.caption = element_text(colour = "#8b949e", size = 9, hjust = 1), axis.text = element_text(colour = "#8b949e", size = 10), axis.title = element_text(colour = "#c9d1d9", size = 11), legend.background = element_rect(fill = "#161b22", colour = NA), legend.key = element_rect(fill = "#161b22", colour = NA), legend.text = element_text(colour = "#c9d1d9"), legend.title = element_text(colour = "#f0f6fc", face = "bold"), strip.text = element_text(colour = "#f0f6fc", face = "bold", size = 11), plot.margin = margin(15, 15, 15, 15) ) } # ---- Colour palettes ---- pal_type <- c("Commercial" = "#00d4ff", "Industrial" = "#ff006e", "Residential" = "#ffbe0b") pal_status <- c("Completed" = "#00ff87", "Ongoing" = "#ffbe0b", "Proposal" = "#8b949e") pal_disc <- c("MEP" = "#00d4ff", "E Only" = "#ff006e", "M Only" = "#ffbe0b") ca_accent <- c("#00d4ff", "#ff006e", "#ffbe0b", "#7928ca", "#00ff87", "#ff6b35") # Apply theme globally theme_set(theme_ca_dark()) ``` ```{r} #| label: load-data #| include: false # ---- Load and clean data ---- df_raw <- read_csv("mep_projects.csv", show_col_types = FALSE) df <- df_raw %>% mutate( # Handle Project_Value_NGN whether read_csv parsed it as character or numeric Project_Value = if (is.character(Project_Value_NGN)) parse_number(Project_Value_NGN) else as.numeric(Project_Value_NGN), Design_Duration_Days = as.numeric(Design_Duration_Days), Revision_Count = as.numeric(Revision_Count), Project_Year = as.integer(Project_Year), Project_Type = factor(Project_Type, levels = c("Commercial", "Industrial", "Residential")), Project_Status = factor(Project_Status, levels = c("Completed", "Ongoing", "Proposal")), Discipline = factor(Discipline), # Derived variables Value_Millions = Project_Value / 1e6, Log_Value = log(Project_Value), Era = case_when( Project_Year <= 2017 ~ "Early (2006-2017)", Project_Year <= 2022 ~ "Growth (2018-2022)", TRUE ~ "Recent (2023-2026)" ), Era = factor(Era, levels = c("Early (2006-2017)", "Growth (2018-2022)", "Recent (2023-2026)")), Is_Design = !is.na(Design_Duration_Days), Value_Category = case_when( Value_Millions < 15 ~ "Small (<₦15M)", Value_Millions < 60 ~ "Medium (₦15-60M)", Value_Millions < 150 ~ "Large (₦60-150M)", TRUE ~ "Mega (>₦150M)" ), Value_Category = factor(Value_Category, levels = c( "Small (<₦15M)", "Medium (₦15-60M)", "Large (₦60-150M)", "Mega (>₦150M)" )) ) # Subset with complete design data df_design <- df %>% filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) ```    # 1. Executive Summary This study analyses the project portfolio of **CA Consultants Limited**, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on `r nrow(df)` real project records spanning `r min(df$Project_Year, na.rm=TRUE)`--`r max(df$Project_Year, na.rm=TRUE)`, extracted from the firm's internal project management systems, I investigate **what drives project value** across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques --- exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression --- are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: **longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.**    # 2. Professional Disclosure **Name:** Chidubem Mba **Title:** Senior Project Engineer & Project Manager **Organisation:** CA Consultants Limited, Lagos, Nigeria **Sector:** Construction / Building Services (MEP Engineering) CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings --- from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations. ### Technique Justification **Exploratory Data Analysis (EDA):** Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses. **Data Visualisation:** Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence --- showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting. **Hypothesis Testing:** Resource allocation decisions at CA Consultants --- such as whether to pursue more commercial versus industrial work --- must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy. **Correlation Analysis:** Understanding the relationships between project characteristics --- value, design duration, revision count, and scope --- is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors. **Linear Regression:** The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.    # 3. Data Collection & Sampling ### Source and Collection Method The dataset was extracted directly from the **internal project register** of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes --- from initial proposal through to completion. Data extraction was performed manually from the company's enterprise filing system and project tracking spreadsheets in April 2026. ### Sampling Frame and Size The dataset constitutes a **census of all identifiable projects** in the firm's records from 2006 to 2026, comprising **`r nrow(df)` project records**. This is not a sample but a near-complete enumeration of the firm's project history, subject to the availability of records in the company's systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records. ### Variables Collected | Variable | Type | Description | |:---|:---|:---| | `Project_ID` | Identifier | Unique project code assigned by the firm | | `Project_Type` | Categorical | Building type: Commercial, Industrial, or Residential | | `Project_Status` | Categorical | Current status: Completed, Ongoing, or Proposal | | `Project_Value_NGN` | Numeric (continuous) | Contract value in Nigerian Naira (₦) | | `Design_Duration_Days` | Numeric (continuous) | Duration of the design phase in calendar days | | `Revision_Count` | Numeric (discrete) | Number of design revisions issued | | `Project_Year` | Date/Time (year) | Year the project was initiated | | `Discipline` | Categorical | Scope of MEP services: MEP (full), E Only, M Only | ### Handling of Missing Data A total of **`r sum(is.na(df$Design_Duration_Days))` observations** are missing `Design_Duration_Days` and **`r sum(is.na(df$Revision_Count))` observations** are missing `Revision_Count`. These are **structurally missing** --- not data-entry errors. These projects were **construction supervision engagements only**: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis --- these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression). ### Ethical Statement This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm's project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.    # 4. Data Description ```{r} #| label: data-metrics #| echo: false n_proj <- nrow(df) total_val <- sum(df$Project_Value, na.rm = TRUE) avg_val <- mean(df$Project_Value, na.rm = TRUE) med_val <- median(df$Project_Value, na.rm = TRUE) n_completed <- sum(df$Project_Status == "Completed") avg_dur <- mean(df$Design_Duration_Days, na.rm = TRUE) htmltools::browsable(htmltools::HTML(sprintf(' <div class="hero-banner"> <div class="metric-grid"> <div class="metric-card"> <span class="metric-label">Total Projects</span> <span class="metric-value" data-target="%d" data-decimals="0">0</span> <span class="metric-sub">2006 — 2026</span> </div> <div class="metric-card"> <span class="metric-label">Portfolio Value</span> <span class="metric-value" data-target="%.1f" data-prefix="₦" data-suffix="B" data-decimals="1">0</span> <span class="metric-sub">Aggregate Contract Value</span> </div> <div class="metric-card"> <span class="metric-label">Median Project Value</span> <span class="metric-value" data-target="%.1f" data-prefix="₦" data-suffix="M" data-decimals="1">0</span> <span class="metric-sub">Central tendency</span> </div> <div class="metric-card"> <span class="metric-label">Completed Projects</span> <span class="metric-value" data-target="%d" data-decimals="0">0</span> <span class="metric-sub">%.0f%%%% completion rate</span> </div> <div class="metric-card"> <span class="metric-label">Avg. Design Duration</span> <span class="metric-value" data-target="%.0f" data-suffix=" days" data-decimals="0">0</span> <span class="metric-sub">Design-phase projects</span> </div> </div> </div> ', n_proj, total_val/1e9, med_val/1e6, n_completed, 100*n_completed/n_proj, avg_dur))) ``` ```{r} #| label: download-btn-description #| echo: false csv_text <- paste(readLines("mep_projects.csv", warn = FALSE), collapse = "\n") csv_b64 <- base64enc::base64encode(charToRaw(csv_text)) htmltools::browsable(htmltools::HTML(sprintf(' <a class="download-data-btn" href="data:text/csv;base64,%s" download="mep_projects.csv"> <svg viewBox="0 0 24 24"><path d="M12 16l-5-5h3V4h4v7h3l-5 5zm-7 2h14v2H5v-2z"/></svg> Download Dataset (CSV) </a> ', csv_b64))) ``` ## 4.1 Summary Statistics ```{r} #| label: summary-stats df %>% select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>% group_by(Variable) %>% summarise( N = sum(!is.na(Value)), Missing = sum(is.na(Value)), Mean = mean(Value, na.rm = TRUE), Median = median(Value, na.rm = TRUE), SD = sd(Value, na.rm = TRUE), Min = min(Value, na.rm = TRUE), Max = max(Value, na.rm = TRUE), Skewness = skewness(Value, na.rm = TRUE), .groups = "drop" ) %>% mutate(across(where(is.numeric), ~round(., 2))) %>% kbl(caption = "Descriptive Statistics for Numeric Variables", format.args = list(big.mark = ",")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ## 4.2 Categorical Variable Distribution ```{r} #| label: cat-distribution cat_summary <- bind_rows( df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"), df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"), df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline") ) %>% group_by(Variable) %>% mutate(Pct = round(100 * n / sum(n), 1)) %>% ungroup() %>% select(Variable, Category, Count = n, `%` = Pct) cat_summary %>% kbl(caption = "Frequency Distribution of Categorical Variables") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>% pack_rows(index = table(cat_summary$Variable)) ``` ## 4.3 Interactive Data Explorer Use the dropdown filters below to dynamically explore the dataset by building type, project status, or discipline scope. ```{r} #| label: download-btn-explorer #| echo: false csv_text2 <- paste(readLines("mep_projects.csv", warn = FALSE), collapse = "\n") csv_b64_2 <- base64enc::base64encode(charToRaw(csv_text2)) htmltools::browsable(htmltools::HTML(sprintf(' <a class="download-data-btn" href="data:text/csv;base64,%s" download="mep_projects.csv"> <svg viewBox="0 0 24 24"><path d="M12 16l-5-5h3V4h4v7h3l-5 5zm-7 2h14v2H5v-2z"/></svg> Download Dataset (CSV) </a> ', csv_b64_2))) ``` ```{r} #| label: data-table library(crosstalk) # Prepare shareable data for crosstalk df_shared <- df %>% select(Project_ID, Project_Type, Project_Status, Value_Millions, Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>% mutate(Value_Millions = round(Value_Millions, 2)) %>% rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days, Revisions = Revision_Count, Year = Project_Year, Type = Project_Type, Status = Project_Status, ID = Project_ID) shared_df <- SharedData$new(df_shared) # Crosstalk filter widgets bscols(widths = c(4, 4, 4), filter_select("type_filter", "Building Type", shared_df, ~Type), filter_select("status_filter", "Project Status", shared_df, ~Status), filter_select("disc_filter", "Discipline", shared_df, ~Discipline) ) ``` ```{r} #| label: data-table-linked datatable(shared_df, caption = "Full Project Dataset — CA Consultants Limited (use filters above)", options = list(pageLength = 10, scrollX = TRUE, dom = 'frtip', initComplete = JS( "function(settings, json) {", " $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});", "}" )), rownames = FALSE ) ```    # 5. Exploratory Data Analysis (EDA) ```{=html} <span class="technique-badge badge-eda">Technique 1 of 5 • Ch. 4</span> ``` > *"Far better an approximate answer to the right question... than an exact answer to the wrong question." --- John Tukey* **Business Justification:** Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics. As Anscombe's Quartet (1973) memorably demonstrates, summary statistics alone can conceal wildly different data structures --- the only remedy is to *look* at the data before modelling it. ::: {.panel-tabset} ## R ### 5.1 Missing Value Analysis ```{r} #| label: eda-missing missing_summary <- df %>% summarise(across(everything(), ~sum(is.na(.)))) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>% mutate( Total = nrow(df), `% Missing` = round(100 * Missing / Total, 1), Status = if_else(Missing == 0, "Complete", "Has Missing") ) %>% filter(Missing > 0) %>% arrange(desc(Missing)) missing_summary %>% kbl(caption = "Variables with Missing Values") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ```{=html} <div class="insight-box"> <strong>Data Quality Finding 1 — Structurally Missing Values:</strong> Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided <strong>construction supervision only</strong> — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling. </div> ``` ### 5.2 Outlier Detection ```{r} #| label: eda-outliers # IQR-based outlier detection on Project Value Q1 <- quantile(df$Project_Value, 0.25, na.rm = TRUE) Q3 <- quantile(df$Project_Value, 0.75, na.rm = TRUE) IQR_val <- Q3 - Q1 lower_fence <- Q1 - 1.5 * IQR_val upper_fence <- Q3 + 1.5 * IQR_val outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence) htmltools::browsable(htmltools::HTML(sprintf(' <div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-amber); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;"> <h4 style="color: var(--accent-amber); margin: 0 0 0.5rem;">IQR Method for Project Value</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;">Q1</td><td>₦%s</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Q3</td><td>₦%s</td></tr> <tr><td style="padding: 4px 16px 4px 0;">IQR</td><td>₦%s</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Upper fence</td><td>₦%s</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>Outliers detected</strong></td><td><strong>%d</strong> (%.1f%%%% of data) — all on the high end</td></tr> </table> </div>', comma(Q1), comma(Q3), comma(IQR_val), comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df)))) ``` ```{r} #| label: eda-outlier-plot #| fig-cap: "Project Value Distribution with Outlier Boundaries" ggplot(df, aes(x = Project_Value / 1e6)) + geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") + geom_density(colour = "#ff006e", linewidth = 1.2) + geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) + annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence", colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) + labs(title = "Distribution of Project Values", subtitle = "Right-skewed with high-value outliers typical of construction portfolios", x = "Project Value (₦ Millions)", y = "Density") + scale_x_continuous(labels = comma_format()) ``` ```{=html} <div class="insight-box"> <strong>Data Quality Finding 2 — Right Skewness:</strong> Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a <strong>log-transformation</strong> will be applied to stabilise variance and improve model fit. </div> ``` ### 5.3 Distribution by Category ```{r} #| label: eda-distributions #| fig-height: 8 p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_type) + labs(title = "Projects by Building Type", x = NULL, y = "Count") + guides(fill = "none") p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_status) + labs(title = "Projects by Status", x = NULL, y = "Count") + guides(fill = "none") p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) + geom_bar(width = 0.6, alpha = 0.85) + scale_fill_manual(values = pal_disc) + labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") + guides(fill = "none") (p1 | p2 | p3) + plot_annotation( title = "Portfolio Composition at a Glance", subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio", theme = theme_ca_dark() ) ``` ## Python ```{python} #| label: eda-python #| output: asis import subprocess, sys _ = subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "tabulate"]) import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib matplotlib.rcParams.update({ 'figure.facecolor': '#161b22', 'axes.facecolor': '#161b22', 'axes.edgecolor': '#30363d', 'axes.labelcolor': '#c9d1d9', 'text.color': '#c9d1d9', 'xtick.color': '#8b949e', 'ytick.color': '#8b949e', 'grid.color': '#30363d', 'font.family': 'sans-serif', 'font.size': 11 }) # Load and clean data df_py = pd.read_csv("mep_projects.csv") if df_py['Project_Value_NGN'].dtype == 'object': df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce') else: df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce') df_py['Value_Millions'] = df_py['Project_Value'] / 1e6 # Summary statistics desc = df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2) print("\n**Numeric Variable Summary**\n") print(desc.to_markdown()) # Missing values missing = df_py.isnull().sum().to_frame("Missing Count") missing["Present"] = len(df_py) - missing["Missing Count"] missing["% Missing"] = (100 * missing["Missing Count"] / len(df_py)).round(1) print("\n\n**Missing Values**\n") print(missing.to_markdown()) # Categorical distributions print("\n\n**Categorical Distributions**\n") for col in ['Project_Type', 'Project_Status', 'Discipline']: counts = df_py[col].value_counts().to_frame("Count") counts["Percentage"] = (100 * counts["Count"] / len(df_py)).round(1).astype(str) + "%" print(f"\n*{col}*\n") print(counts.to_markdown()) ``` ```{python} #| label: eda-python-plot #| fig-cap: "Distribution of Project Values (Python/Matplotlib)" #| results: hide fig, axes = plt.subplots(1, 3, figsize=(14, 5)) # Histogram of values _ = axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22') _ = axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold') _ = axes[0].set_xlabel('Value (₦ Millions)') _ = axes[0].set_ylabel('Frequency') # Bar chart: Project Type type_counts = df_py['Project_Type'].value_counts() colors_type = ['#00d4ff', '#ff006e', '#ffbe0b'] _ = axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85) _ = axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold') _ = axes[1].set_xlabel('Count') # Bar chart: Status status_counts = df_py['Project_Status'].value_counts() colors_status = ['#00ff87', '#ffbe0b', '#8b949e'] _ = axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85) _ = axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold') _ = axes[2].set_xlabel('Count') plt.tight_layout() plt.show() ``` :::    # 6. Data Visualisation ```{=html} <span class="technique-badge badge-viz">Technique 2 of 5 • Ch. 5</span> ``` **Business Justification:** Visualisation transforms CA Consultants' project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question --- and explaining why --- is itself an analytical skill. ::: {.panel-tabset} ## R ### 6.1 Portfolio Value by Building Type ```{r} #| label: viz-violin #| fig-cap: "Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average" p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) + geom_violin(alpha = 0.3, colour = NA, scale = "width") + geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") + geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + labs(title = "Project Value Distribution by Building Type", subtitle = "Commercial projects show the widest range; Residential clusters at lower values", x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none", colour = "none") + scale_y_continuous(labels = comma_format()) ggplotly(p, tooltip = c("y")) %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` > **Why a violin + boxplot?** A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts. ### 6.2 Portfolio Evolution Over Time ```{r} #| label: viz-timeline #| fig-cap: "Annual project count and composition showing the firm's growth trajectory" year_type <- df %>% count(Project_Year, Project_Type) %>% complete(Project_Year, Project_Type, fill = list(n = 0)) ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) + geom_area(alpha = 0.7, position = "stack") + scale_fill_manual(values = pal_type) + labs(title = "CA Consultants Project Volume Over Time", subtitle = "Accelerating growth since 2020 driven by commercial engagements", x = "Year", y = "Number of Projects", fill = "Building Type") + scale_x_continuous(breaks = seq(2006, 2026, 2)) ``` ### 6.3 Design Complexity: Duration vs. Revisions ```{r} #| label: viz-scatter #| fig-cap: "Each bubble is a project — size encodes value, colour encodes building type" p_scatter <- df %>% filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>% ggplot(aes(x = Design_Duration_Days, y = Revision_Count, size = Value_Millions, colour = Project_Type, text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) + geom_point(alpha = 0.7) + scale_colour_manual(values = pal_type) + scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) + labs(title = "Design Complexity Landscape", subtitle = "Longer design phases and more revisions correlate with higher-value projects", x = "Design Duration (Days)", y = "Number of Revisions", colour = "Building Type", size = "Value (₦M)") ggplotly(p_scatter, tooltip = "text") %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` > **Why a bubble chart?** Three dimensions of project complexity --- duration, revision count, and value --- are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees. ### 6.4 Revenue Concentration by Project Size Category ```{r} #| label: viz-treemap #| fig-cap: "Comparing the number of projects versus the revenue they generate" size_summary <- df %>% group_by(Value_Category) %>% summarise( Count = n(), Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9, .groups = "drop" ) %>% mutate( Pct_Count = 100 * Count / sum(Count), Pct_Value = 100 * Total_Value / sum(Total_Value) ) p_conc <- size_summary %>% pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>% mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>% ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) + geom_col(position = "dodge", alpha = 0.85, width = 0.6) + scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) + labs(title = "The Revenue Concentration Paradox", subtitle = "Mega-projects are rare but generate a disproportionate share of revenue", x = NULL, y = "Percentage (%)", fill = NULL) + geom_text(aes(label = paste0(round(Pct, 1), "%")), position = position_dodge(width = 0.6), vjust = -0.5, colour = "#c9d1d9", size = 3.2) p_conc ``` ### 6.5 Discipline Scope and Project Value ```{r} #| label: viz-discipline #| fig-cap: "Full MEP scope commands higher values than single-discipline engagements" ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) + geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) + scale_fill_manual(values = pal_disc) + coord_flip() + labs(title = "Project Value by Discipline Scope", subtitle = "Full-scope MEP projects command significantly higher contract values", x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none") + scale_y_continuous(labels = comma_format()) ``` ## Python ```{python} #| label: viz-python #| fig-cap: "Python visualisation suite using Seaborn" #| results: hide import seaborn as sns fig, axes = plt.subplots(2, 2, figsize=(14, 11)) # 1. Violin: Value by Type palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'} order = ['Commercial', 'Industrial', 'Residential'] _ = sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order, palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0]) _ = axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold') _ = axes[0,0].set_xlabel('') _ = axes[0,0].set_ylabel('Value (₦ Millions)') # 2. Timeline year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0) _ = year_counts.plot.area(ax=axes[0,1], alpha=0.7, color=[palette_type.get(c, '#888') for c in year_counts.columns]) _ = axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold') _ = axes[0,1].set_xlabel('Year') _ = axes[0,1].set_ylabel('Number of Projects') _ = axes[0,1].legend(fontsize=8) # 3. Scatter: Duration vs Value mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna() for ptype, colour in palette_type.items(): subset = df_py[mask & (df_py['Project_Type'] == ptype)] _ = axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'], c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none') _ = axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold') _ = axes[1,0].set_xlabel('Design Duration (Days)') _ = axes[1,0].set_ylabel('Value (₦ Millions)') _ = axes[1,0].legend(fontsize=8) # 4. Box: Value by Discipline disc_order = ['MEP', 'E Only', 'M Only'] palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'} _ = sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order, palette=palette_disc, ax=axes[1,1]) _ = axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold') _ = axes[1,1].set_xlabel('') _ = axes[1,1].set_ylabel('Value (₦ Millions)') plt.tight_layout() plt.show() ``` :::    # 7. Hypothesis Testing ```{=html} <span class="technique-badge badge-hyp">Technique 3 of 5 • Ch. 6</span> ``` **Business Justification:** When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from "it looks like commercial projects are worth more" to "we can be 95% confident that the difference is real and not due to chance." This is the difference between data-informed strategy and guesswork. ::: {.panel-tabset} ## R ### 7.1 Hypothesis 1: Project Value Differs by Building Type **H₀:** The mean project value is equal across Commercial, Industrial, and Residential projects. **H₁:** At least one building type has a significantly different mean project value. **Test:** Given the right-skewed distribution, we use the **Kruskal-Wallis rank-sum test** (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison. ```{r} #| label: hyp1-test # Check normality within groups (Shapiro-Wilk) norm_tests <- df %>% group_by(Project_Type) %>% summarise( n = n(), shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_, .groups = "drop" ) norm_tests %>% mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>% kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ```{r} #| label: hyp1-levene #| echo: true # Levene's Test for Homogeneity of Variance (assumption check) levene_result <- car::leveneTest(Project_Value ~ Project_Type, data = df) interp <- if_else(levene_result$`Pr(>F)`[1] < 0.05, "Variances are significantly unequal — classical ANOVA is inappropriate", "Variances are approximately equal") htmltools::browsable(htmltools::HTML(sprintf(' <div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-cyan); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;"> <h4 style="color: var(--accent-cyan); margin: 0 0 0.5rem;">Levene’s Test for Equal Variances</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;"><strong>F(%d, %d)</strong></td><td>= %.2f</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>Interpretation</strong></td><td>%s</td></tr> </table> <p style="margin: 0.8rem 0 0; color: var(--text-muted); font-size: 0.9rem;"> Since normality is violated (Shapiro-Wilk), we proceed with the <strong>Kruskal-Wallis</strong> non-parametric test as the primary analysis, and supplement with <strong>Welch’s ANOVA</strong> (which does not assume equal variances) for robustness. </p> </div>', levene_result$Df[1], levene_result$Df[2], levene_result$`F value`[1], levene_result$`Pr(>F)`[1], interp))) ``` ```{r} #| label: hyp1-kruskal # Kruskal-Wallis Test (non-parametric) kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df) # Welch ANOVA (does not assume equal variances) welch_test <- oneway.test(Project_Value ~ Project_Type, data = df, var.equal = FALSE) # Effect size (epsilon-squared) kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type) # Pairwise Wilcoxon with Bonferroni correction pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni") agree_txt <- if_else(kw_test$p.value < 0.05 & welch_test$p.value < 0.05, "significant difference exists across building types", "results are consistent") htmltools::browsable(htmltools::HTML(sprintf(' <div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;"> <h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (primary — non-parametric)</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;">χ²</td><td>= %.2f, df = %d</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Effect size (η²[H])</td><td>= %.3f (%s)</td></tr> </table> <h4 style="color: var(--accent-amber); margin: 1rem 0 0.5rem;">Welch’s ANOVA (robustness check)</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;">F(%.1f, %.1f)</td><td>= %.2f</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr> </table> <p style="margin: 0.8rem 0 0; color: var(--accent-green); font-size: 0.9rem; font-weight: 600;"> ✓ Both tests agree: %s </p> </div>', kw_test$statistic, kw_test$parameter, kw_test$p.value, kw_effect$effsize, kw_effect$magnitude, welch_test$parameter[1], welch_test$parameter[2], welch_test$statistic, welch_test$p.value, agree_txt))) ``` ```{r} #| label: hyp1-pairwise # Display pairwise comparisons as styled table pw_matrix <- pairwise$p.value pw_df <- as.data.frame(pw_matrix) pw_df <- cbind(Group = rownames(pw_df), pw_df) pw_df[] <- lapply(pw_df, function(x) if(is.numeric(x)) round(x, 4) else x) htmltools::browsable(htmltools::HTML(paste0( '<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-magenta); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">', '<h4 style="color: var(--accent-magenta); margin: 0 0 0.5rem;">Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted)</h4>', '<p style="color: var(--text-muted); font-size: 0.85rem; margin-bottom: 0.5rem;">Method: ', pairwise$p.adjust.method, '</p>', kableExtra::kbl(pw_df, format = "html", row.names = FALSE) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 14) %>% kableExtra::row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333"), '</div>' ))) ``` ```{r} #| label: hyp1-visual #| fig-cap: "Group comparison visualisation for Hypothesis 1" # Group medians for annotation group_med <- df %>% group_by(Project_Type) %>% summarise(med = median(Project_Value / 1e6), .groups = "drop") ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) + geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) + geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + geom_text(data = group_med, aes(x = Project_Type, y = med, label = paste0("Median: ₦", round(med, 1), "M")), vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") + labs(title = "Project Value by Building Type", subtitle = sprintf("Kruskal-Wallis p = %.4f — %s", kw_test$p.value, if_else(kw_test$p.value < 0.05, "Significant difference detected", "No significant difference")), x = NULL, y = "Project Value (₦ Millions)") + guides(fill = "none", colour = "none") + scale_y_continuous(labels = comma_format()) ``` ```{=html} <div class="result-box reject"> <strong>Hypothesis 1 — Result:</strong> We <strong>reject H₀</strong> at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one. </div> ``` ### 7.2 Hypothesis 2: Design Duration Differs by Building Type **H₀:** The mean design duration is equal across Commercial, Industrial, and Residential projects. **H₁:** At least one building type has a significantly different mean design duration. **Test:** Kruskal-Wallis (non-parametric), using only projects with design data. ```{r} #| label: hyp2-test df_dur <- df %>% filter(!is.na(Design_Duration_Days)) # Kruskal-Wallis kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur) kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type) htmltools::browsable(htmltools::HTML(sprintf(' <div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;"> <h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (Design Duration ~ Building Type)</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;">χ²</td><td>= %.2f, df = %d</td></tr> <tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Effect size (η²[H])</td><td>= %.3f (%s)</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Observations used</td><td>%d (excluding supervision-only projects)</td></tr> </table> </div>', kw_dur$statistic, kw_dur$parameter, kw_dur$p.value, kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur)))) ``` ```{r} #| label: hyp2-visual #| fig-cap: "Design duration comparison across building types" dur_med <- df_dur %>% group_by(Project_Type) %>% summarise(med = median(Design_Duration_Days), .groups = "drop") ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) + geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) + geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) + scale_fill_manual(values = pal_type) + scale_colour_manual(values = pal_type) + geom_text(data = dur_med, aes(x = Project_Type, y = med, label = paste0("Median: ", round(med, 0), " days")), vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") + labs(title = "Design Duration by Building Type", subtitle = sprintf("Kruskal-Wallis p = %.4f — %s", kw_dur$p.value, if_else(kw_dur$p.value < 0.05, "Significant difference detected", "No significant difference")), x = NULL, y = "Design Duration (Days)") + guides(fill = "none", colour = "none") ``` ```{r} #| label: hyp2-result #| echo: false if (kw_dur$p.value < 0.05) { htmltools::browsable(htmltools::HTML(sprintf( '<div class="result-box reject"><strong>Hypothesis 2 — Result:</strong> We <strong>reject H₀</strong> (p = %.4f). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.</div>', kw_dur$p.value))) } else { htmltools::browsable(htmltools::HTML(sprintf( '<div class="result-box"><strong>Hypothesis 2 — Result:</strong> We <strong>fail to reject H₀</strong> (p = %.4f). There is no statistically significant difference in design duration across building types at the 5%%%% level. While median durations may appear different, the variability within each group is large enough that we cannot conclude the differences are systematic. For project planning, this suggests that building type alone is not a reliable predictor of design timeline.</div>', kw_dur$p.value))) } ``` ## Python ```{python} #| label: hyp-python #| output: asis from scipy import stats # Clean numeric columns if df_py['Project_Value_NGN'].dtype == 'object': df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce') else: df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce') df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce') df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce') # Hypothesis 1: Value ~ Project Type groups_val = [grp['Project_Value'].dropna().values for _, grp in df_py.groupby('Project_Type')] kw_stat, kw_p = stats.kruskal(*groups_val) # Hypothesis 1 results dec1 = "Reject H₀" if kw_p < 0.05 else "Fail to reject H₀" hyp1 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"], "Value": [f"{kw_stat:.3f}", f"{kw_p:.4f}", dec1]}) print("**Hypothesis 1: Project Value ~ Building Type**\n") print(hyp1.to_markdown(index=False)) # Hypothesis 2: Duration ~ Project Type df_dur_py = df_py.dropna(subset=['Design_Duration_Days']) groups_dur = [grp['Design_Duration_Days'].dropna().values for _, grp in df_dur_py.groupby('Project_Type')] kw_stat2, kw_p2 = stats.kruskal(*groups_dur) dec2 = "Reject H₀" if kw_p2 < 0.05 else "Fail to reject H₀" hyp2 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"], "Value": [f"{kw_stat2:.3f}", f"{kw_p2:.4f}", dec2]}) print("\n\n**Hypothesis 2: Design Duration ~ Building Type**\n") print(hyp2.to_markdown(index=False)) # Group medians medians = df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_frame("Median Value (₦M)") print("\n\n**Group Medians — Project Value**\n") print(medians.to_markdown()) ``` :::    # 8. Correlation Analysis ```{=html} <span class="technique-badge badge-cor">Technique 4 of 5 • Ch. 8</span> ``` **Business Justification:** Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions. ::: {.panel-tabset} ## R ### 8.1 Correlation Matrix (Spearman) ```{r} #| label: corr-matrix #| fig-cap: "Correlation heatmap of numeric project variables (Spearman method)" # Use Spearman due to non-normality corr_data <- df_design %>% select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>% rename( `Project Value` = Project_Value, `Design Duration` = Design_Duration_Days, `Revision Count` = Revision_Count, `Project Year` = Project_Year ) corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs") ggcorrplot(corr_mat, type = "lower", lab = TRUE, lab_size = 4.5, lab_col = "#f0f6fc", colors = c("#ff006e", "#161b22", "#00d4ff"), outline.color = "#30363d", ggtheme = theme_ca_dark()) + labs(title = "Spearman Correlation Matrix", subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design))) ``` ### 8.2 Pearson vs. Spearman Comparison ```{r} #| label: corr-comparison # Compare Pearson (linear, assumes normality) vs Spearman (rank-based, robust) corr_pearson <- cor(corr_data, method = "pearson", use = "complete.obs") corr_spearman <- corr_mat # Build comparison table vars <- colnames(corr_data) comparison <- data.frame() for (i in 1:(length(vars)-1)) { for (j in (i+1):length(vars)) { comparison <- rbind(comparison, data.frame( Variable_1 = vars[i], Variable_2 = vars[j], Pearson_r = corr_pearson[i,j], Spearman_rho = corr_spearman[i,j], Difference = abs(corr_pearson[i,j] - corr_spearman[i,j]) )) } } comparison %>% arrange(desc(abs(Spearman_rho))) %>% mutate(across(where(is.numeric), ~round(., 3)), Agreement = if_else(Difference < 0.1, "Strong", "Divergent")) %>% kbl(caption = "Pearson vs. Spearman — When they diverge, the relationship may be monotonic but non-linear") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` > **Why both?** Pearson measures *linear* association; Spearman measures *monotonic* association. Where they diverge meaningfully (Difference > 0.1), the relationship exists but is non-linear — important for deciding whether to use linear regression or a non-linear model. ### 8.3 Key Correlations — Business Interpretation ```{r} #| label: corr-detail # Extract and rank correlations corr_pairs <- as.data.frame(as.table(corr_mat)) %>% filter(Var1 != Var2) %>% mutate(abs_corr = abs(Freq)) %>% arrange(desc(abs_corr)) %>% distinct(abs_corr, .keep_all = TRUE) %>% # remove duplicates head(6) %>% rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>% select(-abs_corr) corr_pairs %>% mutate(Spearman_rho = round(Spearman_rho, 3), Strength = case_when( abs(Spearman_rho) >= 0.7 ~ "Strong", abs(Spearman_rho) >= 0.4 ~ "Moderate", TRUE ~ "Weak" )) %>% kbl(caption = "Ranked Pairwise Correlations") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 8.4 Partial Correlation — Controlling for Confounders ```{r} #| label: corr-partial # Partial correlation: Value ~ Duration, controlling for Revision Count and Year # Manual computation using residuals (avoids extra package dependency) partial_cor <- function(x, y, controls, data) { resid_x <- residuals(lm(reformulate(controls, response = x), data = data)) resid_y <- residuals(lm(reformulate(controls, response = y), data = data)) ct <- cor.test(resid_x, resid_y, method = "pearson") list(r = ct$estimate, p = ct$p.value, ci_low = ct$conf.int[1], ci_high = ct$conf.int[2]) } pc1 <- partial_cor("Project_Value", "Design_Duration_Days", c("Revision_Count", "Project_Year"), df_design) pc2 <- partial_cor("Project_Value", "Revision_Count", c("Design_Duration_Days", "Project_Year"), df_design) data.frame( Relationship = c("Value ~ Duration | (Revisions, Year)", "Value ~ Revisions | (Duration, Year)"), Partial_r = c(pc1$r, pc2$r), p_value = c(pc1$p, pc2$p), CI_lower = c(pc1$ci_low, pc2$ci_low), CI_upper = c(pc1$ci_high, pc2$ci_high) ) %>% mutate(across(where(is.numeric), ~round(., 3)), Significant = if_else(p_value < 0.05, "Yes", "No")) %>% kbl(caption = "Partial Correlations — isolating each variable's unique relationship with Project Value") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` > **Why partial correlation?** Zero-order correlations can be misleading when variables are inter-correlated. Design duration and revision count are themselves correlated — a longer project tends to accumulate more revisions. Partial correlation strips out this confounding to reveal each variable's *independent* association with project value. This is critical for fee estimation: if duration's correlation with value vanishes after controlling for revisions, then duration is merely a proxy — not a true driver. ### 8.5 Scatterplot: Strongest Correlation ```{r} #| label: corr-scatter #| fig-cap: "Relationship between design duration and project value — the strongest observed correlation" p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) + geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")), alpha = 0.7, size = 3) + geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15, linetype = "dashed", formula = y ~ x) + scale_colour_manual(values = pal_type) + labs(title = "Design Duration vs. Project Value", subtitle = "Longer design phases are associated with higher-value projects", x = "Design Duration (Days)", y = "Project Value (₦ Millions)", colour = "Building Type") + scale_y_continuous(labels = comma_format()) ggplotly(p_corr, tooltip = "text") %>% layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22', font = list(color = '#c9d1d9')) ``` ```{=html} <div class="insight-box"> <strong>Correlation ≠ Causation:</strong> The positive correlation between design duration and project value does not necessarily mean spending more time on design <em>causes</em> higher fees. The most plausible causal direction is <strong>reverse</strong>: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for <strong>fee estimation</strong>: knowing the expected design duration provides a reasonable proxy for project value. </div> ``` ## Python ```{python} #| label: corr-python #| fig-cap: "Correlation heatmap (Python/Seaborn)" #| results: hide # Prepare design-complete subset df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count']) corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year'] corr_matrix = df_design_py[corr_cols].corr(method='spearman') fig, ax = plt.subplots(figsize=(8, 7)) mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) _ = sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f', cmap='coolwarm', center=0, linewidths=1, linecolor='#30363d', cbar_kws={'shrink': 0.8}, annot_kws={'color': '#f0f6fc', 'fontsize': 12}, ax=ax) _ = ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15) plt.tight_layout() plt.show() ``` ```{python} #| label: corr-python-table #| output: asis # Top correlations as styled table pairs = [] for i in range(len(corr_matrix.columns)): for j in range(i+1, len(corr_matrix.columns)): pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j])) pairs.sort(key=lambda x: abs(x[2]), reverse=True) pairs_df = pd.DataFrame(pairs, columns=["Variable 1", "Variable 2", "Spearman ρ"]) pairs_df["Spearman ρ"] = pairs_df["Spearman ρ"].round(3) print("\n**Top Correlations (by absolute value)**\n") print(pairs_df.to_markdown(index=False)) ``` :::    # 9. Linear Regression ```{=html} <span class="technique-badge badge-reg">Technique 5 of 5 • Ch. 9</span> ``` **Business Justification:** Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a **predictive model** that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: "Given a project's building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?" Each regression coefficient translates directly into a pricing lever. ::: {.panel-tabset} ## R ### 9.1 Model Specification We model **log-transformed project value** to address the right-skewness identified in the EDA. The predictors are: - `Design_Duration_Days` — length of design phase (continuous) - `Revision_Count` — number of design iterations (continuous) - `Project_Type` — building type (categorical: Commercial, Industrial, Residential) - `Discipline` — scope of MEP services (categorical) - `Project_Year` — year initiated (continuous, captures inflation/growth) ```{r} #| label: reg-model # Fit the model on design-complete observations model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count + Project_Type + Discipline + Project_Year, data = df_design) # Model summary model_summary <- summary(model) model_tidy <- tidy(model, conf.int = TRUE) model_glance <- glance(model) htmltools::browsable(htmltools::HTML(sprintf(' <div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-green); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;"> <h4 style="color: var(--accent-green); margin: 0 0 0.5rem;">Model Fit</h4> <table style="font-size: 0.95rem; color: var(--text-secondary);"> <tr><td style="padding: 4px 16px 4px 0;">R²</td><td>%.3f</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Adjusted R²</td><td>%.3f</td></tr> <tr><td style="padding: 4px 16px 4px 0;">F(%d, %d)</td><td>%.2f, p = %.2e</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Residual SE</td><td>%.3f</td></tr> <tr><td style="padding: 4px 16px 4px 0;">Observations</td><td>%d</td></tr> </table> </div>', model_glance$r.squared, model_glance$adj.r.squared, model_glance$df, model_glance$df.residual, model_glance$statistic, model_glance$p.value, model_glance$sigma, nrow(df_design)))) ``` ### 9.2 Coefficient Table ```{r} #| label: reg-coef model_tidy %>% mutate( across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)), p.value = format.pval(p.value, digits = 3), Significance = case_when( as.numeric(p.value) < 0.001 ~ "***", as.numeric(p.value) < 0.01 ~ "**", as.numeric(p.value) < 0.05 ~ "*", as.numeric(p.value) < 0.1 ~ ".", TRUE ~ "" ) ) %>% select(Term = term, Estimate = estimate, `Std. Error` = std.error, `t-value` = statistic, `p-value` = p.value, Significance, `95% CI Low` = conf.low, `95% CI High` = conf.high) %>% kbl(caption = "Regression Coefficients — log(Project Value)") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 9.3 Coefficient Plot ```{r} #| label: reg-coef-plot #| fig-cap: "Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant" model_tidy %>% filter(term != "(Intercept)") %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(x = estimate, y = term)) + geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") + geom_point(colour = "#00d4ff", size = 3) + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, colour = "#00d4ff", linewidth = 0.8) + labs(title = "Regression Coefficient Estimates", subtitle = "log(Project Value) model — points show estimate, bars show 95% CI", x = "Estimate (log scale)", y = NULL) ``` ### 9.4 Diagnostic Plots ```{r} #| label: reg-diagnostics #| fig-height: 9 #| fig-cap: "Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points" par(mfrow = c(2, 2), bg = "#161b22", fg = "#c9d1d9", col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc") plot(model, col = "#00d4ff", pch = 16, cex = 0.8, sub.caption = "") ``` ```{r} #| label: reg-vif # Variance Inflation Factors vif_vals <- car::vif(model) vif_df <- if (is.matrix(vif_vals)) { data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2), Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2)) } else { data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2)) } vif_df %>% kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` ### 9.5 Influential Observations — Cook's Distance ```{r} #| label: reg-cooks #| fig-cap: "Cook's Distance identifies observations with outsized influence on regression coefficients" cooks_d <- cooks.distance(model) n <- nrow(df_design) k <- length(coef(model)) threshold <- 4 / n # common threshold cooks_df <- data.frame( Index = 1:n, Project_ID = df_design$Project_ID, Cooks_D = cooks_d, Influential = cooks_d > threshold ) ggplot(cooks_df, aes(x = Index, y = Cooks_D)) + geom_segment(aes(xend = Index, y = 0, yend = Cooks_D, colour = Influential), linewidth = 0.7) + geom_point(aes(colour = Influential), size = 2) + geom_hline(yintercept = threshold, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.7) + annotate("text", x = n * 0.85, y = threshold + max(cooks_d) * 0.05, label = paste0("Threshold = 4/n = ", round(threshold, 3)), colour = "#ffbe0b", size = 3.5) + scale_colour_manual(values = c("FALSE" = "#00d4ff", "TRUE" = "#ff006e"), labels = c("Normal", "Influential")) + labs(title = "Cook's Distance — Influential Observation Detection", subtitle = sprintf("%d of %d observations exceed the 4/n threshold", sum(cooks_df$Influential), n), x = "Observation Index", y = "Cook's Distance", colour = NULL) ``` ```{r} #| label: reg-cooks-table # Show the most influential observations cooks_df %>% filter(Influential) %>% left_join(df_design %>% mutate(Index = row_number()) %>% select(Index, Project_ID, Project_Type, Value_Millions, Design_Duration_Days), by = c("Index", "Project_ID")) %>% arrange(desc(Cooks_D)) %>% mutate(Cooks_D = round(Cooks_D, 4), Value_Millions = round(Value_Millions, 1)) %>% head(10) %>% kbl(caption = "Most Influential Observations — these projects disproportionately shape the regression line") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") ``` > **Why Cook's Distance matters:** In a small dataset of `r nrow(df_design)` observations, a single mega-project can tilt the entire regression line. Identifying influential points ensures our pricing recommendations are not driven by one or two outlier contracts. If removing an influential point materially changes the coefficients, we should investigate whether it represents an unusual circumstance (e.g., a rush project, an atypical client) rather than a general pattern. ### 9.6 Business Interpretation ```{r} #| label: reg-interpret #| echo: false # Extract key coefficients coefs <- coef(model) pvals <- model_tidy$p.value # Duration coefficient dur_coef <- coefs["Design_Duration_Days"] dur_pct <- (exp(dur_coef) - 1) * 100 rev_pct <- (exp(coefs["Revision_Count"]) - 1) * 100 dur10_value <- 50 * (exp(dur_coef * 10) - 1) htmltools::browsable(htmltools::HTML(sprintf(' <p><strong>Interpreting Key Coefficients (on the log scale → percentage change in value):</strong></p> <ul> <li><strong>Design Duration:</strong> Each additional design day is associated with a <strong>%.1f%%%% change</strong> in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦%.1fM in additional value.</li> <li><strong>Revision Count:</strong> Each additional revision is associated with a <strong>%.1f%%%% change</strong> in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.</li> <li><strong>Project Year:</strong> The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.</li> </ul> <div class="insight-box"> <strong>Actionable Recommendation:</strong> The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a <strong>fee estimation formula</strong> anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula. </div> ', dur_pct, dur10_value, rev_pct))) ``` ## Python ```{python} #| label: reg-python #| output: asis import statsmodels.api as sm import statsmodels.formula.api as smf # Prepare data df_reg_py = df_design_py.copy() df_reg_py['log_value'] = np.log(df_reg_py['Project_Value']) # Fit OLS model formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year' model_py = smf.ols(formula, data=df_reg_py).fit() # Model fit summary print("**Model Fit Summary**\n") fit_df = pd.DataFrame({ "Metric": ["R-squared", "Adj. R-squared", "F-statistic", "Prob (F-statistic)", "Log-Likelihood", "AIC", "BIC", "Observations", "Df Residuals"], "Value": [f"{model_py.rsquared:.4f}", f"{model_py.rsquared_adj:.4f}", f"{model_py.fvalue:.3f}", f"{model_py.f_pvalue:.2e}", f"{model_py.llf:.2f}", f"{model_py.aic:.1f}", f"{model_py.bic:.1f}", f"{int(model_py.nobs)}", f"{int(model_py.df_resid)}"] }) print(fit_df.to_markdown(index=False)) # Coefficient table print("\n\n**Coefficient Table**\n") coef_df = pd.DataFrame({ "Term": model_py.params.index, "Coefficient": model_py.params.round(4).values, "Std Error": model_py.bse.round(4).values, "t-value": model_py.tvalues.round(3).values, "p-value": model_py.pvalues.round(4).values, "CI Lower": model_py.conf_int()[0].round(3).values, "CI Upper": model_py.conf_int()[1].round(3).values, }) coef_df["Sig"] = coef_df["p-value"].apply(lambda p: "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "") print(coef_df.to_markdown(index=False)) # Diagnostics print("\n\n**Residual Diagnostics**\n") from scipy.stats import jarque_bera jb_result = jarque_bera(model_py.resid) resid_skew = float(model_py.resid.skew()) resid_kurt = float(model_py.resid.kurtosis() + 3) diag_df = pd.DataFrame({ "Diagnostic": ["Durbin-Watson", "Jarque-Bera", "Prob(JB)", "Skewness", "Kurtosis", "Condition No."], "Value": [f"{sm.stats.durbin_watson(model_py.resid):.3f}", f"{jb_result.statistic:.3f}", f"{jb_result.pvalue:.2e}", f"{resid_skew:.3f}", f"{resid_kurt:.3f}", f"{np.linalg.cond(model_py.model.exog):.0f}"] }) print(diag_df.to_markdown(index=False)) ``` ```{python} #| label: reg-python-diag #| fig-cap: "Residual diagnostic plots (Python/statsmodels)" #| results: hide fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Residuals vs Fitted fitted = model_py.fittedvalues resids = model_py.resid _ = axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none') _ = axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7) _ = axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold') _ = axes[0].set_xlabel('Fitted Values') _ = axes[0].set_ylabel('Residuals') # Q-Q Plot from scipy.stats import probplot _ = probplot(resids, plot=axes[1]) axes[1].get_lines()[0].set_color('#00d4ff') axes[1].get_lines()[0].set_markersize(5) axes[1].get_lines()[1].set_color('#ff006e') _ = axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold') plt.tight_layout() plt.show() ``` :::    # 10. Integrated Findings The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants' project portfolio and the drivers of project value: **1. The Portfolio is Concentrated and Skewed.** EDA revealed that the firm's revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact). **2. Commercial Projects Are Statistically More Valuable.** Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification. **3. Design Complexity Predicts Value.** Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope. **4. Full-Scope MEP Engagements Outperform.** Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts. ::: {.callout-tip} ## Single Integrated Recommendation CA Consultants should **anchor its growth strategy on full-scope MEP design for commercial buildings**, using the regression model's coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect. :::    # 11. Limitations & Further Work **Data limitations:** - The dataset contains `r nrow(df)` observations spanning 20 years. While this represents the firm's full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse. - Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to `r nrow(df_design)` observations. - Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling. - The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value. **Methodological limitations:** - Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives. - The regression model uses cross-sectional data without controls for inflation or market conditions; the `Project_Year` variable is a rough proxy. - Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence. **Future work with more data, time, or computing power:** - Collect **client industry codes** and **location data** (mainland Lagos vs. island) to model geographic and sectoral pricing effects. - Build a **time-series model** of project pipeline values to forecast quarterly revenue. - Introduce **design team size** and **junior/senior engineer ratio** as predictors to model the labour-cost side of project economics. - Develop an **interactive dashboard** (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.    # References Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. <https://markanalytics.online> Mba, C. (2026). *CA Consultants Limited MEP project register (2006–2026)* [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author. R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. <https://www.R-project.org/> Van Rossum, G., & Drake, F. L. (2009). *Python 3 reference manual.* CreateSpace. Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. <https://doi.org/10.21105/joss.01686> Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. <https://doi.org/10.1007/978-3-319-24277-4> Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny.* Chapman and Hall/CRC. <https://plotly-r.com> Kassambara, A. (2023). *ggcorrplot: Visualization of a correlation matrix using ggplot2* (R package). <https://CRAN.R-project.org/package=ggcorrplot> Fox, J., & Weisberg, S. (2019). *An R companion to applied regression* (3rd ed.). Sage. <https://socialsciences.mcmaster.ca/jfox/Books/Companion/> McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). <https://doi.org/10.25080/Majora-92bf1922-00a> Pedersen, T. L. (2024). *patchwork: The composer of plots* (R package). <https://CRAN.R-project.org/package=patchwork> Kassambara, A. (2023). *rstatix: Pipe-friendly framework for basic statistical tests* (R package). <https://CRAN.R-project.org/package=rstatix> Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) [Computer software]. <https://doi.org/10.5281/zenodo.5960048>    # Appendix A: GitHub Repository (+5 Bonus Marks) ::: {.callout-tip} ## GitHub Repository The complete source code and data for this analysis are publicly available at: **[https://github.com/chidubemsbox-art/Dubydarego](https://github.com/chidubemsbox-art/Dubydarego)** The repository contains: - `CA_Consultants_MEP_Analysis.qmd` — the Quarto source document - `mep_projects.csv` — the anonymised project dataset - `README.md` — reproduction instructions To reproduce: clone the repo, open the `.qmd` in RStudio, and click **Render**. :::    # Appendix B: AI Usage Statement This submission was prepared with the assistance of **Claude** (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality). All **analytical decisions** were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants' internal systems. The author can explain every line of code and every result in this document. ```{=html} <div class="document-footer"> <div class="footer-divider"></div> <div class="footer-name">Chidubem Mba</div> <div>Senior Project Engineer & Project Manager</div> <div style="font-size: 0.85rem; color: #8b949e; margin-top: 2px;">Matric No: 2025-MMBA-8-039</div> <div class="footer-firm">CA Consultants Limited  •  Lagos, Nigeria</div> <div class="footer-divider"></div> <div class="footer-programme"> Submitted in partial fulfilment of the requirements for the<br> <strong>Modular MBA — Data Analytics II</strong><br> Lagos Business School, Pan-Atlantic University  •  May 2026 </div> </div> ```