---
title: "Building By The Numbers"
subtitle: "What 101 Projects Reveal About Design, Complexity, and Revenue at CA Consultants Limited"
author: "Chidubem Mba — Senior Project Engineer & Project Manager"
date: today
format:
html:
theme: darkly
toc: true
toc-depth: 3
toc-title: "Navigation"
code-fold: true
code-tools: true
code-summary: "View Code"
self-contained: true
smooth-scroll: true
fig-align: center
fig-width: 10
fig-height: 6
execute:
warning: false
message: false
echo: true
---
```{=html}
<style>
/* ===== CUSTOM DARK MODERN THEME ===== */
/* Google Fonts */
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap');
/* Root Variables */
:root {
--bg-primary: #0d1117;
--bg-secondary: #161b22;
--bg-tertiary: #1c2333;
--border-color: #30363d;
--border-accent: #58a6ff;
--text-primary: #f0f6fc;
--text-secondary: #c9d1d9;
--text-muted: #8b949e;
--accent-cyan: #00d4ff;
--accent-magenta: #ff006e;
--accent-amber: #ffbe0b;
--accent-purple: #7928ca;
--accent-green: #00ff87;
--gradient-main: linear-gradient(135deg, #00d4ff 0%, #7928ca 50%, #ff006e 100%);
--gradient-warm: linear-gradient(135deg, #ff006e 0%, #ffbe0b 100%);
--gradient-cool: linear-gradient(135deg, #00d4ff 0%, #00ff87 100%);
--shadow-glow: 0 0 30px rgba(0, 212, 255, 0.15);
--shadow-card: 0 8px 32px rgba(0, 0, 0, 0.4);
}
/* Global */
body {
font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif !important;
background: var(--bg-primary) !important;
color: var(--text-secondary) !important;
line-height: 1.7;
letter-spacing: 0.01em;
}
/* ===== CONSTRUCTION BLUEPRINT BACKGROUND ===== */
body::before {
content: '';
position: fixed;
top: 0;
left: 0;
width: 100%;
height: 100%;
pointer-events: none;
z-index: -1;
background:
/* Subtle blueprint grid */
linear-gradient(rgba(0,212,255,0.03) 1px, transparent 1px),
linear-gradient(90deg, rgba(0,212,255,0.03) 1px, transparent 1px),
/* Larger grid squares */
linear-gradient(rgba(0,212,255,0.015) 2px, transparent 2px),
linear-gradient(90deg, rgba(0,212,255,0.015) 2px, transparent 2px);
background-size:
20px 20px,
20px 20px,
100px 100px,
100px 100px;
}
/* Faint engineering schematic accents on the side */
body::after {
content: '';
position: fixed;
top: 0;
right: 0;
width: 300px;
height: 100%;
pointer-events: none;
z-index: -1;
background:
radial-gradient(circle at 80% 20%, rgba(0,212,255,0.04) 0%, transparent 50%),
radial-gradient(circle at 60% 70%, rgba(121,40,202,0.03) 0%, transparent 40%),
radial-gradient(circle at 90% 90%, rgba(255,0,110,0.02) 0%, transparent 30%);
}
/* ===== DOWNLOAD DATA BUTTON ===== */
.download-data-btn {
display: inline-flex;
align-items: center;
gap: 8px;
padding: 10px 22px;
background: linear-gradient(135deg, rgba(0,212,255,0.12), rgba(121,40,202,0.10));
border: 1px solid rgba(0,212,255,0.3);
border-radius: 8px;
color: var(--accent-cyan);
font-family: 'Inter', sans-serif;
font-size: 0.9rem;
font-weight: 600;
cursor: pointer;
text-decoration: none;
transition: all 0.3s ease;
margin: 1rem 0;
}
.download-data-btn:hover {
background: linear-gradient(135deg, rgba(0,212,255,0.22), rgba(121,40,202,0.18));
border-color: var(--accent-cyan);
box-shadow: 0 0 20px rgba(0,212,255,0.2);
transform: translateY(-1px);
color: #ffffff;
text-decoration: none;
}
.download-data-btn svg {
width: 18px;
height: 18px;
fill: currentColor;
}
/* ===== MARGIN GLOW STRIPS + RULER MARKS ===== */
.margin-decor {
position: fixed;
top: 0;
height: 100%;
pointer-events: none;
z-index: 100;
display: flex;
}
.margin-decor.left {
left: 0;
}
.margin-decor.right {
right: 0;
}
.margin-decor .glow-strip {
width: 4px;
}
.margin-decor.left .glow-strip {
background: linear-gradient(180deg, #00d4ff 0%, #7928ca 40%, #ff006e 70%, transparent 100%);
box-shadow: 0 0 12px rgba(0,212,255,0.3), 0 0 24px rgba(121,40,202,0.15);
}
.margin-decor.right .glow-strip {
background: linear-gradient(180deg, transparent 0%, #ff006e 30%, #7928ca 60%, #00d4ff 100%);
box-shadow: 0 0 12px rgba(255,0,110,0.3), 0 0 24px rgba(121,40,202,0.15);
}
.margin-decor .ruler {
width: 20px;
position: relative;
background:
repeating-linear-gradient(180deg, transparent 0px, transparent 19px, rgba(0,212,255,0.1) 19px, rgba(0,212,255,0.1) 20px),
repeating-linear-gradient(180deg, transparent 0px, transparent 79px, rgba(0,212,255,0.22) 79px, rgba(0,212,255,0.22) 80px);
}
.margin-decor.left .ruler {
border-right: 1px solid rgba(0,212,255,0.1);
}
.margin-decor.right .ruler {
border-left: 1px solid rgba(0,212,255,0.1);
}
.margin-decor .ruler-number {
position: absolute;
font-size: 7px;
color: rgba(0,212,255,0.3);
font-family: 'JetBrains Mono', monospace;
}
.margin-decor.left .ruler-number {
right: 4px;
}
.margin-decor.right .ruler-number {
left: 4px;
}
/* ===== LOGO (positioned next to date) ===== */
.quarto-title-block {
position: relative;
padding: 1.5rem 0 2rem 0;
margin-bottom: 2rem;
border-bottom: 2px solid transparent;
border-image: var(--gradient-main) 1;
}
/* ===== FIRM BRANDING BAR ===== */
.firm-branding-bar {
background: linear-gradient(135deg, rgba(0,212,255,0.08), rgba(121,40,202,0.06));
border-bottom: 1px solid rgba(0,212,255,0.2);
padding: 12px 2rem;
display: flex;
align-items: center;
justify-content: space-between;
position: relative;
z-index: 10;
}
.firm-branding-bar .firm-left {
display: flex;
align-items: center;
gap: 14px;
}
.firm-branding-bar .firm-logo {
max-width: 70px;
height: auto;
border-radius: 4px;
filter: drop-shadow(0 2px 8px rgba(0,212,255,0.15));
transition: transform 0.3s ease;
}
.firm-branding-bar .firm-logo:hover {
transform: scale(1.05);
}
.firm-branding-bar .firm-name {
font-family: 'Inter', sans-serif;
font-weight: 700;
font-size: 1.1rem;
color: var(--accent-cyan);
letter-spacing: 0.02em;
}
.firm-branding-bar .firm-tagline {
font-size: 0.78rem;
color: var(--text-muted);
font-style: italic;
margin-top: 2px;
}
.firm-branding-bar .firm-location {
font-size: 0.78rem;
color: var(--text-muted);
letter-spacing: 0.02em;
}
.quarto-title .title {
font-family: 'Inter', sans-serif !important;
font-weight: 800 !important;
font-size: 2.8rem !important;
background: var(--gradient-main);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
line-height: 1.2 !important;
margin-bottom: 0.8rem !important;
}
.quarto-title .subtitle {
font-size: 1.15rem !important;
font-style: italic !important;
font-weight: 300 !important;
letter-spacing: 0.04em;
text-transform: none !important;
background: linear-gradient(90deg, var(--accent-cyan), var(--accent-amber), var(--accent-magenta));
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
background-size: 200% auto;
animation: subtitleShimmer 4s ease-in-out infinite;
}
@keyframes subtitleShimmer {
0% { background-position: 0% center; }
50% { background-position: 100% center; }
100% { background-position: 0% center; }
}
/* Section Headings */
h1, .h1 {
font-family: 'Inter', sans-serif !important;
font-weight: 700 !important;
font-size: 2rem !important;
background: var(--gradient-main);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
padding-top: 1.5rem;
margin-top: 3rem !important;
border-top: 1px solid var(--border-color);
}
h2, .h2 {
font-family: 'Inter', sans-serif !important;
font-weight: 600 !important;
font-size: 1.5rem !important;
color: var(--accent-cyan) !important;
margin-top: 2rem !important;
}
h3, .h3 {
font-family: 'Inter', sans-serif !important;
font-weight: 600 !important;
font-size: 1.2rem !important;
color: var(--text-primary) !important;
}
/* Main Content Container */
#quarto-content {
background: var(--bg-primary);
}
.content {
max-width: 1100px;
}
/* Cards / Section Containers */
.card-section {
background: var(--bg-secondary);
border: 1px solid var(--border-color);
border-radius: 12px;
padding: 2rem;
margin: 1.5rem 0;
box-shadow: var(--shadow-card);
transition: border-color 0.3s ease, box-shadow 0.3s ease;
}
.card-section:hover {
border-color: var(--border-accent);
box-shadow: var(--shadow-glow);
}
/* Metric Cards Grid */
.metric-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 1.2rem;
margin: 2rem 0;
}
.metric-card {
background: var(--bg-secondary);
border: 1px solid var(--border-color);
border-radius: 12px;
padding: 1.5rem;
text-align: center;
position: relative;
overflow: hidden;
transition: all 0.3s ease;
}
.metric-card::before {
content: '';
position: absolute;
top: 0;
left: 0;
right: 0;
height: 3px;
background: var(--gradient-main);
}
.metric-card:hover {
transform: translateY(-4px);
box-shadow: var(--shadow-glow);
border-color: var(--accent-cyan);
}
.metric-label {
display: block;
font-size: 0.8rem;
text-transform: uppercase;
letter-spacing: 0.1em;
color: var(--text-muted);
margin-bottom: 0.5rem;
font-weight: 500;
}
.metric-value {
display: block;
font-size: 2rem;
font-weight: 700;
background: var(--gradient-main);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.metric-sub {
display: block;
font-size: 0.85rem;
color: var(--text-muted);
margin-top: 0.3rem;
}
/* Callout Boxes */
.callout {
background: var(--bg-secondary) !important;
border-radius: 10px !important;
border-left: 4px solid var(--accent-cyan) !important;
box-shadow: var(--shadow-card) !important;
}
.callout-note {
border-left-color: var(--accent-cyan) !important;
}
.callout-important {
border-left-color: var(--accent-magenta) !important;
}
.callout-tip {
border-left-color: var(--accent-green) !important;
}
/* Tab Styling */
.nav-tabs {
border-bottom: 2px solid var(--border-color) !important;
margin-bottom: 1.5rem;
}
.nav-tabs .nav-link {
font-family: 'Inter', sans-serif !important;
font-weight: 600 !important;
font-size: 0.95rem !important;
color: var(--text-muted) !important;
border: none !important;
padding: 0.8rem 1.5rem !important;
border-radius: 8px 8px 0 0 !important;
transition: all 0.3s ease !important;
text-transform: uppercase;
letter-spacing: 0.05em;
}
.nav-tabs .nav-link:hover {
color: var(--accent-cyan) !important;
background: rgba(0, 212, 255, 0.08) !important;
}
.nav-tabs .nav-link.active {
color: var(--accent-cyan) !important;
background: var(--bg-secondary) !important;
border-bottom: 3px solid var(--accent-cyan) !important;
}
/* Code Blocks */
pre {
background: var(--bg-primary) !important;
border: 1px solid var(--border-color) !important;
border-radius: 10px !important;
padding: 1.2rem !important;
font-family: 'JetBrains Mono', 'Fira Code', monospace !important;
font-size: 0.85rem !important;
box-shadow: inset 0 2px 8px rgba(0,0,0,0.3);
}
code {
font-family: 'JetBrains Mono', 'Fira Code', monospace !important;
}
.code-fold-btn-container .btn {
font-family: 'Inter', sans-serif !important;
font-size: 0.8rem !important;
color: var(--accent-cyan) !important;
border-color: var(--border-color) !important;
background: var(--bg-secondary) !important;
border-radius: 6px !important;
}
/* Tables */
.table {
background: var(--bg-secondary) !important;
color: var(--text-secondary) !important;
border-color: var(--border-color) !important;
}
.table thead th {
background: var(--bg-tertiary) !important;
color: var(--accent-cyan) !important;
border-color: var(--border-color) !important;
font-weight: 600;
text-transform: uppercase;
font-size: 0.8rem;
letter-spacing: 0.05em;
}
.table td {
border-color: var(--border-color) !important;
}
.dataTables_wrapper {
color: var(--text-secondary) !important;
}
.dataTables_wrapper .dataTables_filter input,
.dataTables_wrapper .dataTables_length select {
background: var(--bg-primary) !important;
color: var(--text-secondary) !important;
border: 1px solid var(--border-color) !important;
border-radius: 6px !important;
}
.dataTables_wrapper .dataTables_paginate .paginate_button {
color: var(--text-secondary) !important;
}
.dataTables_wrapper .dataTables_paginate .paginate_button.current {
background: var(--accent-cyan) !important;
color: var(--bg-primary) !important;
border: none !important;
border-radius: 6px !important;
}
/* TOC Sidebar */
#TOC {
background: var(--bg-secondary) !important;
border-right: 1px solid var(--border-color) !important;
padding: 1rem !important;
}
#TOC a {
color: var(--text-muted) !important;
transition: color 0.2s ease;
}
#TOC a:hover,
#TOC a.active {
color: var(--accent-cyan) !important;
}
/* Scrollbar */
::-webkit-scrollbar {
width: 8px;
height: 8px;
}
::-webkit-scrollbar-track {
background: var(--bg-primary);
}
::-webkit-scrollbar-thumb {
background: var(--border-color);
border-radius: 4px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--accent-cyan);
}
/* Plotly toolbar */
.modebar {
background: transparent !important;
}
/* Smooth section reveals */
section {
animation: fadeInUp 0.6s ease-out;
}
@keyframes fadeInUp {
from {
opacity: 0;
transform: translateY(20px);
}
to {
opacity: 1;
transform: translateY(0);
}
}
/* Business insight callout */
.insight-box {
background: linear-gradient(135deg, rgba(0,212,255,0.08), rgba(121,40,202,0.08));
border: 1px solid rgba(0,212,255,0.3);
border-radius: 10px;
padding: 1.5rem;
margin: 1.5rem 0;
position: relative;
}
.insight-box::before {
content: '💡';
position: absolute;
top: -12px;
left: 16px;
font-size: 1.5rem;
background: var(--bg-primary);
padding: 0 0.5rem;
}
/* Hypothesis result box */
.result-box {
background: var(--bg-secondary);
border-left: 4px solid var(--accent-green);
border-radius: 0 10px 10px 0;
padding: 1.2rem 1.5rem;
margin: 1rem 0;
}
.result-box.reject {
border-left-color: var(--accent-magenta);
}
/* Image and figure styling */
.quarto-figure {
background: var(--bg-secondary);
border-radius: 12px;
padding: 1rem;
border: 1px solid var(--border-color);
}
/* Blockquotes */
blockquote {
border-left: 3px solid var(--accent-purple) !important;
background: rgba(121, 40, 202, 0.05) !important;
padding: 1rem 1.5rem !important;
border-radius: 0 8px 8px 0 !important;
font-style: italic;
color: var(--text-muted) !important;
}
/* Selection colour */
::selection {
background: rgba(0, 212, 255, 0.3);
color: var(--text-primary);
}
/* ===== SCROLL PROGRESS BAR ===== */
.progress-bar-container {
position: fixed;
top: 0;
left: 0;
width: 100%;
height: 3px;
z-index: 9999;
background: var(--bg-primary);
}
.progress-bar-fill {
height: 100%;
width: 0%;
background: var(--gradient-main);
transition: width 0.1s ease-out;
border-radius: 0 2px 2px 0;
box-shadow: 0 0 10px rgba(0, 212, 255, 0.5);
}
/* ===== BACK TO TOP BUTTON ===== */
.back-to-top {
position: fixed;
bottom: 2rem;
right: 2rem;
width: 48px;
height: 48px;
border-radius: 50%;
background: var(--gradient-main);
border: none;
cursor: pointer;
display: flex;
align-items: center;
justify-content: center;
opacity: 0;
visibility: hidden;
transform: translateY(20px);
transition: all 0.3s ease;
z-index: 999;
box-shadow: 0 4px 15px rgba(0, 212, 255, 0.3);
}
.back-to-top.visible {
opacity: 1;
visibility: visible;
transform: translateY(0);
}
.back-to-top:hover {
transform: translateY(-3px);
box-shadow: 0 6px 25px rgba(0, 212, 255, 0.5);
}
.back-to-top svg {
width: 24px;
height: 24px;
fill: white;
}
/* ===== HERO BANNER ===== */
.hero-banner {
position: relative;
padding: 2rem;
margin: -1rem -1rem 2rem -1rem;
border-radius: 16px;
background: linear-gradient(-45deg, #0d1117, #161b22, #1a1040, #0d2137);
background-size: 400% 400%;
animation: gradientShift 8s ease infinite;
border: 1px solid var(--border-color);
overflow: hidden;
}
.hero-banner::before {
content: '';
position: absolute;
top: -50%;
left: -50%;
width: 200%;
height: 200%;
background: radial-gradient(circle, rgba(0,212,255,0.05) 0%, transparent 50%);
animation: pulseGlow 4s ease-in-out infinite;
}
@keyframes gradientShift {
0% { background-position: 0% 50%; }
50% { background-position: 100% 50%; }
100% { background-position: 0% 50%; }
}
@keyframes pulseGlow {
0%, 100% { transform: scale(1); opacity: 0.5; }
50% { transform: scale(1.1); opacity: 1; }
}
/* ===== ANIMATED COUNTER ===== */
.metric-value[data-target] {
font-variant-numeric: tabular-nums;
}
/* ===== ENHANCED SECTION TRANSITIONS ===== */
.section-enter {
opacity: 0;
transform: translateY(30px);
transition: opacity 0.6s ease-out, transform 0.6s ease-out;
}
.section-enter.visible {
opacity: 1;
transform: translateY(0);
}
/* ===== TECHNIQUE BADGE ===== */
.technique-badge {
display: inline-block;
padding: 0.3rem 1rem;
border-radius: 20px;
font-size: 0.75rem;
font-weight: 600;
text-transform: uppercase;
letter-spacing: 0.08em;
margin-bottom: 1rem;
}
.badge-eda { background: rgba(0,212,255,0.15); color: #00d4ff; border: 1px solid rgba(0,212,255,0.3); }
.badge-viz { background: rgba(255,0,110,0.15); color: #ff006e; border: 1px solid rgba(255,0,110,0.3); }
.badge-hyp { background: rgba(255,190,11,0.15); color: #ffbe0b; border: 1px solid rgba(255,190,11,0.3); }
.badge-cor { background: rgba(121,40,202,0.15); color: #a855f7; border: 1px solid rgba(121,40,202,0.3); }
.badge-reg { background: rgba(0,255,135,0.15); color: #00ff87; border: 1px solid rgba(0,255,135,0.3); }
/* ===== STAT RESULT HIGHLIGHT ===== */
.stat-highlight {
display: inline-block;
background: rgba(0,212,255,0.1);
border: 1px solid rgba(0,212,255,0.2);
padding: 0.2rem 0.6rem;
border-radius: 6px;
font-family: 'JetBrains Mono', monospace;
font-size: 0.9rem;
color: var(--accent-cyan);
}
/* ===== CROSSTALK FILTER STYLING ===== */
.crosstalk-input-select select,
.crosstalk-input-checkboxgroup label {
background: var(--bg-primary) !important;
color: var(--text-secondary) !important;
border: 1px solid var(--border-color) !important;
border-radius: 6px;
padding: 0.4rem;
}
.crosstalk-input-select label,
.crosstalk-input-checkboxgroup label {
color: var(--text-muted) !important;
font-weight: 500;
text-transform: uppercase;
font-size: 0.75rem;
letter-spacing: 0.05em;
}
/* ===== PROFESSIONAL FOOTER ===== */
.document-footer {
margin-top: 4rem;
padding: 2rem 0;
border-top: 2px solid transparent;
border-image: var(--gradient-main) 1;
text-align: center;
color: var(--text-muted);
font-size: 0.9rem;
line-height: 1.8;
}
.document-footer .footer-name {
font-weight: 700;
font-size: 1.05rem;
background: var(--gradient-main);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.document-footer .footer-firm {
color: var(--accent-cyan);
font-weight: 500;
}
.document-footer .footer-divider {
width: 80px;
height: 2px;
background: var(--gradient-main);
margin: 1rem auto;
border-radius: 2px;
}
.document-footer .footer-programme {
font-style: italic;
color: var(--text-secondary);
font-size: 0.85rem;
max-width: 600px;
margin: 0.5rem auto 0;
}
/* Print-friendly overrides */
@media print {
body { background: white !important; color: black !important; }
h1, h2, h3 { -webkit-text-fill-color: #1a1a2e !important; }
.metric-value { -webkit-text-fill-color: #1a1a2e !important; }
.progress-bar-container, .back-to-top, .margin-decor { display: none !important; }
.document-footer .footer-name { -webkit-text-fill-color: #1a1a2e !important; }
}
</style>
```
```{r}
#| label: embed-logo
#| echo: false
#| results: asis
logo_path <- "ca_logo.png"
if (file.exists(logo_path)) {
logo_b64 <- base64enc::base64encode(logo_path)
logo_uri <- paste0("data:image/png;base64,", logo_b64)
} else {
logo_uri <- ""
}
htmltools::browsable(htmltools::HTML(sprintf(
'<script>var caLogoDataURI = "%s";</script>', logo_uri
)))
```
```{=html}
<!-- Progress Bar -->
<div class="progress-bar-container"><div class="progress-bar-fill" id="progressBar"></div></div>
<!-- Back to Top Button -->
<button class="back-to-top" id="backToTop" onclick="window.scrollTo({top:0,behavior:'smooth'})">
<svg viewBox="0 0 24 24"><path d="M12 4l-8 8h5v8h6v-8h5z"/></svg>
</button>
<script>
// ---- Scroll Progress Bar ----
window.addEventListener('scroll', function() {
var scrollTop = window.scrollY;
var docHeight = document.documentElement.scrollHeight - window.innerHeight;
var progress = (scrollTop / docHeight) * 100;
document.getElementById('progressBar').style.width = progress + '%';
// Back to top visibility
var btn = document.getElementById('backToTop');
if (scrollTop > 500) { btn.classList.add('visible'); }
else { btn.classList.remove('visible'); }
});
// ---- Animated Counters ----
function animateCounter(el) {
var target = parseFloat(el.getAttribute('data-target'));
var prefix = el.getAttribute('data-prefix') || '';
var suffix = el.getAttribute('data-suffix') || '';
var decimals = parseInt(el.getAttribute('data-decimals')) || 0;
var duration = 1500;
var startTime = null;
function step(timestamp) {
if (!startTime) startTime = timestamp;
var progress = Math.min((timestamp - startTime) / duration, 1);
var eased = 1 - Math.pow(1 - progress, 3); // ease-out cubic
var current = eased * target;
el.textContent = prefix + current.toFixed(decimals) + suffix;
if (progress < 1) requestAnimationFrame(step);
}
requestAnimationFrame(step);
}
// Intersection Observer for counters
var counterObserver = new IntersectionObserver(function(entries) {
entries.forEach(function(entry) {
if (entry.isIntersecting && !entry.target.classList.contains('counted')) {
entry.target.classList.add('counted');
animateCounter(entry.target);
}
});
}, { threshold: 0.5 });
document.addEventListener('DOMContentLoaded', function() {
// Inject margin glow strips + ruler marks
function createMarginDecor(side) {
var decor = document.createElement('div');
decor.className = 'margin-decor ' + side;
var glow = document.createElement('div');
glow.className = 'glow-strip';
var ruler = document.createElement('div');
ruler.className = 'ruler';
for (var i = 1; i <= 20; i++) {
var num = document.createElement('span');
num.className = 'ruler-number';
num.style.top = (i * 80 - 2) + 'px';
num.textContent = i;
ruler.appendChild(num);
}
if (side === 'left') {
decor.appendChild(glow);
decor.appendChild(ruler);
} else {
decor.appendChild(ruler);
decor.appendChild(glow);
}
document.body.appendChild(decor);
}
createMarginDecor('left');
createMarginDecor('right');
// Inject firm branding bar at the very top of the page
var brandingBar = document.createElement('div');
brandingBar.className = 'firm-branding-bar';
brandingBar.innerHTML = '' +
'<div class="firm-left">' +
'<img src="' + (typeof caLogoDataURI !== 'undefined' && caLogoDataURI ? caLogoDataURI : 'ca_logo.png') + '" alt="CA Consultants Limited" class="firm-logo" />' +
'<div>' +
'<div class="firm-name">CA Consultants Limited</div>' +
'<div class="firm-tagline">MEP Engineering • Design • Supervision</div>' +
'</div>' +
'</div>' +
'<div class="firm-location">Lagos, Nigeria</div>';
var mainContent = document.querySelector('#quarto-content') || document.querySelector('main') || document.body.firstChild;
if (mainContent) {
mainContent.parentNode.insertBefore(brandingBar, mainContent);
} else {
document.body.insertBefore(brandingBar, document.body.firstChild);
}
// Animated counters
document.querySelectorAll('.metric-value[data-target]').forEach(function(el) {
counterObserver.observe(el);
});
// Section enter animations
var sectionObserver = new IntersectionObserver(function(entries) {
entries.forEach(function(entry) {
if (entry.isIntersecting) entry.target.classList.add('visible');
});
}, { threshold: 0.1 });
document.querySelectorAll('section').forEach(function(s) {
s.classList.add('section-enter');
sectionObserver.observe(s);
});
});
</script>
```
<!-- ============================================================== -->
<!-- SECTION 0: SETUP -->
<!-- ============================================================== -->
```{r}
#| label: setup
#| include: false
# ---- Install missing packages (runs silently) ----
required_pkgs <- c(
"tidyverse", "plotly", "DT", "ggcorrplot", "car", "broom",
"scales", "kableExtra", "patchwork", "rstatix", "moments",
"viridis", "effectsize", "crosstalk", "htmltools", "base64enc"
)
invisible(lapply(required_pkgs, function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg, quiet = TRUE)
}))
# ---- Load libraries ----
library(tidyverse)
library(plotly)
library(DT)
library(ggcorrplot)
library(car)
library(broom)
library(scales)
library(kableExtra)
library(patchwork)
library(rstatix)
library(moments)
library(viridis)
library(effectsize)
# ---- Custom ggplot2 dark theme ----
theme_ca_dark <- function(base_size = 13) {
theme_minimal(base_size = base_size) %+replace%
theme(
plot.background = element_rect(fill = "#161b22", colour = NA),
panel.background = element_rect(fill = "#161b22", colour = NA),
panel.grid.major = element_line(colour = "#30363d", linewidth = 0.3),
panel.grid.minor = element_blank(),
text = element_text(colour = "#c9d1d9", family = "sans"),
plot.title = element_text(colour = "#f0f6fc", size = 16, face = "bold",
margin = margin(b = 8)),
plot.subtitle = element_text(colour = "#8b949e", size = 12,
margin = margin(b = 12)),
plot.caption = element_text(colour = "#8b949e", size = 9, hjust = 1),
axis.text = element_text(colour = "#8b949e", size = 10),
axis.title = element_text(colour = "#c9d1d9", size = 11),
legend.background = element_rect(fill = "#161b22", colour = NA),
legend.key = element_rect(fill = "#161b22", colour = NA),
legend.text = element_text(colour = "#c9d1d9"),
legend.title = element_text(colour = "#f0f6fc", face = "bold"),
strip.text = element_text(colour = "#f0f6fc", face = "bold", size = 11),
plot.margin = margin(15, 15, 15, 15)
)
}
# ---- Colour palettes ----
pal_type <- c("Commercial" = "#00d4ff", "Industrial" = "#ff006e", "Residential" = "#ffbe0b")
pal_status <- c("Completed" = "#00ff87", "Ongoing" = "#ffbe0b", "Proposal" = "#8b949e")
pal_disc <- c("MEP" = "#00d4ff", "E Only" = "#ff006e", "M Only" = "#ffbe0b")
ca_accent <- c("#00d4ff", "#ff006e", "#ffbe0b", "#7928ca", "#00ff87", "#ff6b35")
# Apply theme globally
theme_set(theme_ca_dark())
```
```{r}
#| label: load-data
#| include: false
# ---- Load and clean data ----
df_raw <- read_csv("mep_projects.csv", show_col_types = FALSE)
df <- df_raw %>%
mutate(
# Handle Project_Value_NGN whether read_csv parsed it as character or numeric
Project_Value = if (is.character(Project_Value_NGN))
parse_number(Project_Value_NGN)
else
as.numeric(Project_Value_NGN),
Design_Duration_Days = as.numeric(Design_Duration_Days),
Revision_Count = as.numeric(Revision_Count),
Project_Year = as.integer(Project_Year),
Project_Type = factor(Project_Type, levels = c("Commercial", "Industrial", "Residential")),
Project_Status = factor(Project_Status, levels = c("Completed", "Ongoing", "Proposal")),
Discipline = factor(Discipline),
# Derived variables
Value_Millions = Project_Value / 1e6,
Log_Value = log(Project_Value),
Era = case_when(
Project_Year <= 2017 ~ "Early (2006-2017)",
Project_Year <= 2022 ~ "Growth (2018-2022)",
TRUE ~ "Recent (2023-2026)"
),
Era = factor(Era, levels = c("Early (2006-2017)", "Growth (2018-2022)", "Recent (2023-2026)")),
Is_Design = !is.na(Design_Duration_Days),
Value_Category = case_when(
Value_Millions < 15 ~ "Small (<₦15M)",
Value_Millions < 60 ~ "Medium (₦15-60M)",
Value_Millions < 150 ~ "Large (₦60-150M)",
TRUE ~ "Mega (>₦150M)"
),
Value_Category = factor(Value_Category, levels = c(
"Small (<₦15M)", "Medium (₦15-60M)", "Large (₦60-150M)", "Mega (>₦150M)"
))
)
# Subset with complete design data
df_design <- df %>% filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count))
```
<!-- ============================================================== -->
<!-- SECTION 1: EXECUTIVE SUMMARY -->
<!-- ============================================================== -->
# 1. Executive Summary
This study analyses the project portfolio of **CA Consultants Limited**, a leading MEP (Mechanical, Electrical, and Plumbing) consulting firm headquartered in Lagos, Nigeria. Drawing on `r nrow(df)` real project records spanning `r min(df$Project_Year, na.rm=TRUE)`--`r max(df$Project_Year, na.rm=TRUE)`, extracted from the firm's internal project management systems, I investigate **what drives project value** across different building types, service disciplines, and design complexity levels. Five foundational analytical techniques --- exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and multiple linear regression --- are applied to move from exploration to inference. Key findings reveal that the portfolio is heavily right-skewed, with a small number of high-value projects contributing disproportionately to total revenue. Commercial projects command significantly higher values than residential and industrial engagements, and design duration together with revision count are meaningful predictors of project value. The analysis yields an actionable pricing and resource-allocation framework: **longer-duration, multi-discipline MEP projects consistently deliver greater value, and the firm should prioritise commercial engagements while investing in design-phase efficiency to reduce costly revision cycles.**
<!-- ============================================================== -->
<!-- SECTION 2: PROFESSIONAL DISCLOSURE -->
<!-- ============================================================== -->
# 2. Professional Disclosure
**Name:** Chidubem Mba
**Title:** Senior Project Engineer & Project Manager
**Organisation:** CA Consultants Limited, Lagos, Nigeria
**Sector:** Construction / Building Services (MEP Engineering)
CA Consultants Limited is a Mechanical, Electrical, and Plumbing (MEP) consulting firm operating in the Nigerian construction industry. The firm provides design and construction supervision services for commercial, industrial, and residential buildings --- from office towers and factories to luxury residences. As a Senior Project Engineer and Project Manager, I am responsible for managing design deliverables, coordinating with architects and contractors, overseeing project timelines, and ensuring quality of MEP installations.
### Technique Justification
**Exploratory Data Analysis (EDA):**
Understanding the composition and quality of our project data is the essential first step. As a project manager, I need to know the distribution of project sizes, identify data gaps (such as supervision-only projects lacking design records), and detect anomalous entries before any strategic conclusions can be drawn. EDA allows me to assess the health of our data and uncover patterns that inform all subsequent analyses.
**Data Visualisation:**
Communicating project performance to firm leadership and clients requires clear, compelling visual narratives. Visualisation transforms raw project records into actionable intelligence --- showing at a glance which sectors drive revenue, how portfolio composition has shifted over time, and where design complexity concentrates. Every plot in this study answers a question a principal at CA Consultants would ask in a management meeting.
**Hypothesis Testing:**
Resource allocation decisions at CA Consultants --- such as whether to pursue more commercial versus industrial work --- must be grounded in statistical evidence, not intuition. Formal hypothesis testing allows me to determine whether observed differences in project value across building types are statistically significant or merely due to sampling variability. This directly informs our business development strategy.
**Correlation Analysis:**
Understanding the relationships between project characteristics --- value, design duration, revision count, and scope --- is critical for project planning. If design duration correlates strongly with project value, that relationship shapes how we quote fees and allocate engineering hours. Identifying which correlations are strongest, and which are spurious, helps avoid costly planning errors.
**Linear Regression:**
The ultimate goal is a quantitative model that explains and predicts project value based on measurable project attributes. A regression model allows me to quantify the marginal contribution of each extra design day, each additional revision, and each project type to the contract value. This directly supports fee-setting, proposal evaluation, and capacity planning at CA Consultants.
<!-- ============================================================== -->
<!-- SECTION 3: DATA COLLECTION & SAMPLING -->
<!-- ============================================================== -->
# 3. Data Collection & Sampling
### Source and Collection Method
The dataset was extracted directly from the **internal project register** of CA Consultants Limited. This register is maintained by the project management team (of which I am a member) and records every project the firm undertakes --- from initial proposal through to completion. Data extraction was performed manually from the company's enterprise filing system and project tracking spreadsheets in April 2026.
### Sampling Frame and Size
The dataset constitutes a **census of all identifiable projects** in the firm's records from 2006 to 2026, comprising **`r nrow(df)` project records**. This is not a sample but a near-complete enumeration of the firm's project history, subject to the availability of records in the company's systems. Projects prior to 2006 were excluded due to incomplete digitisation of older records.
### Variables Collected
| Variable | Type | Description |
|:---|:---|:---|
| `Project_ID` | Identifier | Unique project code assigned by the firm |
| `Project_Type` | Categorical | Building type: Commercial, Industrial, or Residential |
| `Project_Status` | Categorical | Current status: Completed, Ongoing, or Proposal |
| `Project_Value_NGN` | Numeric (continuous) | Contract value in Nigerian Naira (₦) |
| `Design_Duration_Days` | Numeric (continuous) | Duration of the design phase in calendar days |
| `Revision_Count` | Numeric (discrete) | Number of design revisions issued |
| `Project_Year` | Date/Time (year) | Year the project was initiated |
| `Discipline` | Categorical | Scope of MEP services: MEP (full), E Only, M Only |
### Handling of Missing Data
A total of **`r sum(is.na(df$Design_Duration_Days))` observations** are missing `Design_Duration_Days` and **`r sum(is.na(df$Revision_Count))` observations** are missing `Revision_Count`. These are **structurally missing** --- not data-entry errors. These projects were **construction supervision engagements only**: CA Consultants supervised the on-site installation of MEP systems designed by another firm. Since no design work was performed in-house, no design duration or revision count exists. This is documented and handled transparently throughout the analysis --- these records are retained for portfolio-level analyses (value, type, year) and excluded only from design-specific analyses (correlation, regression).
### Ethical Statement
This data was collected in my capacity as a Senior Project Engineer with authorised access to the firm's project records. All project identifiers use internal codes (e.g., CA658, CA042/PJ027) that do not reveal client names. Financial values are contract-level figures, not client-confidential breakdowns. No personally identifiable information (PII) is included. Permission to use anonymised project data for academic purposes was obtained from CA Consultants management.
<!-- ============================================================== -->
<!-- SECTION 4: DATA DESCRIPTION -->
<!-- ============================================================== -->
# 4. Data Description
```{r}
#| label: data-metrics
#| echo: false
n_proj <- nrow(df)
total_val <- sum(df$Project_Value, na.rm = TRUE)
avg_val <- mean(df$Project_Value, na.rm = TRUE)
med_val <- median(df$Project_Value, na.rm = TRUE)
n_completed <- sum(df$Project_Status == "Completed")
avg_dur <- mean(df$Design_Duration_Days, na.rm = TRUE)
htmltools::browsable(htmltools::HTML(sprintf('
<div class="hero-banner">
<div class="metric-grid">
<div class="metric-card">
<span class="metric-label">Total Projects</span>
<span class="metric-value" data-target="%d" data-decimals="0">0</span>
<span class="metric-sub">2006 — 2026</span>
</div>
<div class="metric-card">
<span class="metric-label">Portfolio Value</span>
<span class="metric-value" data-target="%.1f" data-prefix="₦" data-suffix="B" data-decimals="1">0</span>
<span class="metric-sub">Aggregate Contract Value</span>
</div>
<div class="metric-card">
<span class="metric-label">Median Project Value</span>
<span class="metric-value" data-target="%.1f" data-prefix="₦" data-suffix="M" data-decimals="1">0</span>
<span class="metric-sub">Central tendency</span>
</div>
<div class="metric-card">
<span class="metric-label">Completed Projects</span>
<span class="metric-value" data-target="%d" data-decimals="0">0</span>
<span class="metric-sub">%.0f%%%% completion rate</span>
</div>
<div class="metric-card">
<span class="metric-label">Avg. Design Duration</span>
<span class="metric-value" data-target="%.0f" data-suffix=" days" data-decimals="0">0</span>
<span class="metric-sub">Design-phase projects</span>
</div>
</div>
</div>
', n_proj, total_val/1e9, med_val/1e6, n_completed, 100*n_completed/n_proj, avg_dur)))
```
```{r}
#| label: download-btn-description
#| echo: false
csv_text <- paste(readLines("mep_projects.csv", warn = FALSE), collapse = "\n")
csv_b64 <- base64enc::base64encode(charToRaw(csv_text))
htmltools::browsable(htmltools::HTML(sprintf('
<a class="download-data-btn"
href="data:text/csv;base64,%s"
download="mep_projects.csv">
<svg viewBox="0 0 24 24"><path d="M12 16l-5-5h3V4h4v7h3l-5 5zm-7 2h14v2H5v-2z"/></svg>
Download Dataset (CSV)
</a>
', csv_b64)))
```
## 4.1 Summary Statistics
```{r}
#| label: summary-stats
df %>%
select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
group_by(Variable) %>%
summarise(
N = sum(!is.na(Value)),
Missing = sum(is.na(Value)),
Mean = mean(Value, na.rm = TRUE),
Median = median(Value, na.rm = TRUE),
SD = sd(Value, na.rm = TRUE),
Min = min(Value, na.rm = TRUE),
Max = max(Value, na.rm = TRUE),
Skewness = skewness(Value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
kbl(caption = "Descriptive Statistics for Numeric Variables",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
## 4.2 Categorical Variable Distribution
```{r}
#| label: cat-distribution
cat_summary <- bind_rows(
df %>% count(Project_Type) %>% rename(Category = Project_Type) %>% mutate(Variable = "Project_Type"),
df %>% count(Project_Status) %>% rename(Category = Project_Status) %>% mutate(Variable = "Project_Status"),
df %>% count(Discipline) %>% rename(Category = Discipline) %>% mutate(Variable = "Discipline")
) %>%
group_by(Variable) %>%
mutate(Pct = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
select(Variable, Category, Count = n, `%` = Pct)
cat_summary %>%
kbl(caption = "Frequency Distribution of Categorical Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333") %>%
pack_rows(index = table(cat_summary$Variable))
```
## 4.3 Interactive Data Explorer
Use the dropdown filters below to dynamically explore the dataset by building type, project status, or discipline scope.
```{r}
#| label: download-btn-explorer
#| echo: false
csv_text2 <- paste(readLines("mep_projects.csv", warn = FALSE), collapse = "\n")
csv_b64_2 <- base64enc::base64encode(charToRaw(csv_text2))
htmltools::browsable(htmltools::HTML(sprintf('
<a class="download-data-btn"
href="data:text/csv;base64,%s"
download="mep_projects.csv">
<svg viewBox="0 0 24 24"><path d="M12 16l-5-5h3V4h4v7h3l-5 5zm-7 2h14v2H5v-2z"/></svg>
Download Dataset (CSV)
</a>
', csv_b64_2)))
```
```{r}
#| label: data-table
library(crosstalk)
# Prepare shareable data for crosstalk
df_shared <- df %>%
select(Project_ID, Project_Type, Project_Status, Value_Millions,
Design_Duration_Days, Revision_Count, Project_Year, Discipline) %>%
mutate(Value_Millions = round(Value_Millions, 2)) %>%
rename(`Value (₦M)` = Value_Millions, `Duration (Days)` = Design_Duration_Days,
Revisions = Revision_Count, Year = Project_Year,
Type = Project_Type, Status = Project_Status, ID = Project_ID)
shared_df <- SharedData$new(df_shared)
# Crosstalk filter widgets
bscols(widths = c(4, 4, 4),
filter_select("type_filter", "Building Type", shared_df, ~Type),
filter_select("status_filter", "Project Status", shared_df, ~Status),
filter_select("disc_filter", "Discipline", shared_df, ~Discipline)
)
```
```{r}
#| label: data-table-linked
datatable(shared_df,
caption = "Full Project Dataset — CA Consultants Limited (use filters above)",
options = list(pageLength = 10, scrollX = TRUE,
dom = 'frtip',
initComplete = JS(
"function(settings, json) {",
" $(this.api().table().header()).css({'background-color': '#1c2333', 'color': '#00d4ff'});",
"}"
)),
rownames = FALSE
)
```
<!-- ============================================================== -->
<!-- SECTION 5: TECHNIQUE 1 — EXPLORATORY DATA ANALYSIS -->
<!-- ============================================================== -->
# 5. Exploratory Data Analysis (EDA)
```{=html}
<span class="technique-badge badge-eda">Technique 1 of 5 • Ch. 4</span>
```
> *"Far better an approximate answer to the right question... than an exact answer to the wrong question." --- John Tukey*
**Business Justification:** Before building models or testing hypotheses, a project manager must understand the shape and quality of the data that drives decisions. EDA reveals whether our project records are complete, whether extreme values distort averages, and what the underlying distributions look like. At CA Consultants, this means understanding whether our reported portfolio metrics faithfully represent actual project economics. As Anscombe's Quartet (1973) memorably demonstrates, summary statistics alone can conceal wildly different data structures --- the only remedy is to *look* at the data before modelling it.
::: {.panel-tabset}
## R
### 5.1 Missing Value Analysis
```{r}
#| label: eda-missing
missing_summary <- df %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
mutate(
Total = nrow(df),
`% Missing` = round(100 * Missing / Total, 1),
Status = if_else(Missing == 0, "Complete", "Has Missing")
) %>%
filter(Missing > 0) %>%
arrange(desc(Missing))
missing_summary %>%
kbl(caption = "Variables with Missing Values") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 13) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
```{=html}
<div class="insight-box">
<strong>Data Quality Finding 1 — Structurally Missing Values:</strong>
Design_Duration_Days and Revision_Count are missing for projects where CA Consultants provided <strong>construction supervision only</strong> — no design work was performed in-house, so no design duration or revision history exists. These are not data-entry errors but reflect a genuine operational distinction. These records are retained for portfolio-level analysis and excluded from design-specific modelling.
</div>
```
### 5.2 Outlier Detection
```{r}
#| label: eda-outliers
# IQR-based outlier detection on Project Value
Q1 <- quantile(df$Project_Value, 0.25, na.rm = TRUE)
Q3 <- quantile(df$Project_Value, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR_val
upper_fence <- Q3 + 1.5 * IQR_val
outliers <- df %>% filter(Project_Value > upper_fence | Project_Value < lower_fence)
htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-amber); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
<h4 style="color: var(--accent-amber); margin: 0 0 0.5rem;">IQR Method for Project Value</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;">Q1</td><td>₦%s</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Q3</td><td>₦%s</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">IQR</td><td>₦%s</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Upper fence</td><td>₦%s</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>Outliers detected</strong></td><td><strong>%d</strong> (%.1f%%%% of data) — all on the high end</td></tr>
</table>
</div>',
comma(Q1), comma(Q3), comma(IQR_val),
comma(upper_fence), nrow(outliers), 100*nrow(outliers)/nrow(df))))
```
```{r}
#| label: eda-outlier-plot
#| fig-cap: "Project Value Distribution with Outlier Boundaries"
ggplot(df, aes(x = Project_Value / 1e6)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "#00d4ff", alpha = 0.6, colour = "#161b22") +
geom_density(colour = "#ff006e", linewidth = 1.2) +
geom_vline(xintercept = upper_fence / 1e6, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = upper_fence / 1e6 + 15, y = Inf, label = "Outlier Fence",
colour = "#ffbe0b", vjust = 2, hjust = 0, size = 3.5) +
labs(title = "Distribution of Project Values",
subtitle = "Right-skewed with high-value outliers typical of construction portfolios",
x = "Project Value (₦ Millions)", y = "Density") +
scale_x_continuous(labels = comma_format())
```
```{=html}
<div class="insight-box">
<strong>Data Quality Finding 2 — Right Skewness:</strong>
Project values are heavily right-skewed (skewness ≈ 2+), meaning a few mega-projects (>₦150M) pull the mean well above the median. This is typical of construction consulting portfolios where a small number of landmark projects generate disproportionate revenue. For regression analysis, a <strong>log-transformation</strong> will be applied to stabilise variance and improve model fit.
</div>
```
### 5.3 Distribution by Category
```{r}
#| label: eda-distributions
#| fig-height: 8
p1 <- ggplot(df, aes(x = Project_Type, fill = Project_Type)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_type) +
labs(title = "Projects by Building Type", x = NULL, y = "Count") +
guides(fill = "none")
p2 <- ggplot(df, aes(x = Project_Status, fill = Project_Status)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_status) +
labs(title = "Projects by Status", x = NULL, y = "Count") +
guides(fill = "none")
p3 <- ggplot(df, aes(x = Discipline, fill = Discipline)) +
geom_bar(width = 0.6, alpha = 0.85) +
scale_fill_manual(values = pal_disc) +
labs(title = "Projects by Discipline Scope", x = NULL, y = "Count") +
guides(fill = "none")
(p1 | p2 | p3) +
plot_annotation(
title = "Portfolio Composition at a Glance",
subtitle = "Commercial projects and full-scope MEP engagements dominate the portfolio",
theme = theme_ca_dark()
)
```
## Python
```{python}
#| label: eda-python
#| output: asis
import subprocess, sys
_ = subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "tabulate"])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({
'figure.facecolor': '#161b22',
'axes.facecolor': '#161b22',
'axes.edgecolor': '#30363d',
'axes.labelcolor': '#c9d1d9',
'text.color': '#c9d1d9',
'xtick.color': '#8b949e',
'ytick.color': '#8b949e',
'grid.color': '#30363d',
'font.family': 'sans-serif',
'font.size': 11
})
# Load and clean data
df_py = pd.read_csv("mep_projects.csv")
if df_py['Project_Value_NGN'].dtype == 'object':
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
else:
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce')
df_py['Value_Millions'] = df_py['Project_Value'] / 1e6
# Summary statistics
desc = df_py[['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']].describe().round(2)
print("\n**Numeric Variable Summary**\n")
print(desc.to_markdown())
# Missing values
missing = df_py.isnull().sum().to_frame("Missing Count")
missing["Present"] = len(df_py) - missing["Missing Count"]
missing["% Missing"] = (100 * missing["Missing Count"] / len(df_py)).round(1)
print("\n\n**Missing Values**\n")
print(missing.to_markdown())
# Categorical distributions
print("\n\n**Categorical Distributions**\n")
for col in ['Project_Type', 'Project_Status', 'Discipline']:
counts = df_py[col].value_counts().to_frame("Count")
counts["Percentage"] = (100 * counts["Count"] / len(df_py)).round(1).astype(str) + "%"
print(f"\n*{col}*\n")
print(counts.to_markdown())
```
```{python}
#| label: eda-python-plot
#| fig-cap: "Distribution of Project Values (Python/Matplotlib)"
#| results: hide
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
# Histogram of values
_ = axes[0].hist(df_py['Value_Millions'].dropna(), bins=25, color='#00d4ff', alpha=0.7, edgecolor='#161b22')
_ = axes[0].set_title('Project Value Distribution', color='#f0f6fc', fontweight='bold')
_ = axes[0].set_xlabel('Value (₦ Millions)')
_ = axes[0].set_ylabel('Frequency')
# Bar chart: Project Type
type_counts = df_py['Project_Type'].value_counts()
colors_type = ['#00d4ff', '#ff006e', '#ffbe0b']
_ = axes[1].barh(type_counts.index, type_counts.values, color=colors_type[:len(type_counts)], alpha=0.85)
_ = axes[1].set_title('Projects by Type', color='#f0f6fc', fontweight='bold')
_ = axes[1].set_xlabel('Count')
# Bar chart: Status
status_counts = df_py['Project_Status'].value_counts()
colors_status = ['#00ff87', '#ffbe0b', '#8b949e']
_ = axes[2].barh(status_counts.index, status_counts.values, color=colors_status[:len(status_counts)], alpha=0.85)
_ = axes[2].set_title('Projects by Status', color='#f0f6fc', fontweight='bold')
_ = axes[2].set_xlabel('Count')
plt.tight_layout()
plt.show()
```
:::
<!-- ============================================================== -->
<!-- SECTION 6: TECHNIQUE 2 — DATA VISUALISATION -->
<!-- ============================================================== -->
# 6. Data Visualisation
```{=html}
<span class="technique-badge badge-viz">Technique 2 of 5 • Ch. 5</span>
```
**Business Justification:** Visualisation transforms CA Consultants' project records into a narrative that leadership can act on. Each plot below answers a specific strategic question: Where does our revenue concentrate? How has our portfolio evolved? What is the relationship between design effort and project scale? Choosing the right chart type for each question --- and explaining why --- is itself an analytical skill.
::: {.panel-tabset}
## R
### 6.1 Portfolio Value by Building Type
```{r}
#| label: viz-violin
#| fig-cap: "Distribution of Project Values by Building Type — violin + jitter reveals the full shape, not just the average"
p <- ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
geom_violin(alpha = 0.3, colour = NA, scale = "width") +
geom_boxplot(width = 0.15, alpha = 0.6, outlier.shape = NA, colour = "#c9d1d9") +
geom_jitter(aes(colour = Project_Type), width = 0.12, alpha = 0.6, size = 2) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
labs(title = "Project Value Distribution by Building Type",
subtitle = "Commercial projects show the widest range; Residential clusters at lower values",
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none", colour = "none") +
scale_y_continuous(labels = comma_format())
ggplotly(p, tooltip = c("y")) %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))
```
> **Why a violin + boxplot?** A bar chart would show only the mean, hiding the skewness and multimodality that define our portfolio. The violin reveals that most commercial projects cluster below ₦100M, with a long tail of high-value engagements. This shape directly informs risk: our revenue depends on winning a few large contracts.
### 6.2 Portfolio Evolution Over Time
```{r}
#| label: viz-timeline
#| fig-cap: "Annual project count and composition showing the firm's growth trajectory"
year_type <- df %>%
count(Project_Year, Project_Type) %>%
complete(Project_Year, Project_Type, fill = list(n = 0))
ggplot(year_type, aes(x = Project_Year, y = n, fill = Project_Type)) +
geom_area(alpha = 0.7, position = "stack") +
scale_fill_manual(values = pal_type) +
labs(title = "CA Consultants Project Volume Over Time",
subtitle = "Accelerating growth since 2020 driven by commercial engagements",
x = "Year", y = "Number of Projects", fill = "Building Type") +
scale_x_continuous(breaks = seq(2006, 2026, 2))
```
### 6.3 Design Complexity: Duration vs. Revisions
```{r}
#| label: viz-scatter
#| fig-cap: "Each bubble is a project — size encodes value, colour encodes building type"
p_scatter <- df %>%
filter(!is.na(Design_Duration_Days) & !is.na(Revision_Count)) %>%
ggplot(aes(x = Design_Duration_Days, y = Revision_Count,
size = Value_Millions, colour = Project_Type,
text = paste0("ID: ", Project_ID, "<br>Value: ₦", round(Value_Millions, 1), "M"))) +
geom_point(alpha = 0.7) +
scale_colour_manual(values = pal_type) +
scale_size_continuous(range = c(2, 15), labels = comma_format(suffix = "M")) +
labs(title = "Design Complexity Landscape",
subtitle = "Longer design phases and more revisions correlate with higher-value projects",
x = "Design Duration (Days)", y = "Number of Revisions",
colour = "Building Type", size = "Value (₦M)")
ggplotly(p_scatter, tooltip = "text") %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))
```
> **Why a bubble chart?** Three dimensions of project complexity --- duration, revision count, and value --- are encoded simultaneously. This reveals that the upper-right quadrant (long duration, many revisions) is exclusively occupied by high-value commercial projects, validating the intuition that complex projects command premium fees.
### 6.4 Revenue Concentration by Project Size Category
```{r}
#| label: viz-treemap
#| fig-cap: "Comparing the number of projects versus the revenue they generate"
size_summary <- df %>%
group_by(Value_Category) %>%
summarise(
Count = n(),
Total_Value = sum(Project_Value, na.rm = TRUE) / 1e9,
.groups = "drop"
) %>%
mutate(
Pct_Count = 100 * Count / sum(Count),
Pct_Value = 100 * Total_Value / sum(Total_Value)
)
p_conc <- size_summary %>%
pivot_longer(cols = c(Pct_Count, Pct_Value), names_to = "Metric", values_to = "Pct") %>%
mutate(Metric = if_else(Metric == "Pct_Count", "% of Projects", "% of Revenue")) %>%
ggplot(aes(x = Value_Category, y = Pct, fill = Metric)) +
geom_col(position = "dodge", alpha = 0.85, width = 0.6) +
scale_fill_manual(values = c("% of Projects" = "#00d4ff", "% of Revenue" = "#ff006e")) +
labs(title = "The Revenue Concentration Paradox",
subtitle = "Mega-projects are rare but generate a disproportionate share of revenue",
x = NULL, y = "Percentage (%)", fill = NULL) +
geom_text(aes(label = paste0(round(Pct, 1), "%")),
position = position_dodge(width = 0.6), vjust = -0.5,
colour = "#c9d1d9", size = 3.2)
p_conc
```
### 6.5 Discipline Scope and Project Value
```{r}
#| label: viz-discipline
#| fig-cap: "Full MEP scope commands higher values than single-discipline engagements"
ggplot(df, aes(x = reorder(Discipline, Value_Millions, FUN = median), y = Value_Millions, fill = Discipline)) +
geom_boxplot(alpha = 0.7, width = 0.5, outlier.colour = "#ff006e", outlier.alpha = 0.5) +
scale_fill_manual(values = pal_disc) +
coord_flip() +
labs(title = "Project Value by Discipline Scope",
subtitle = "Full-scope MEP projects command significantly higher contract values",
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none") +
scale_y_continuous(labels = comma_format())
```
## Python
```{python}
#| label: viz-python
#| fig-cap: "Python visualisation suite using Seaborn"
#| results: hide
import seaborn as sns
fig, axes = plt.subplots(2, 2, figsize=(14, 11))
# 1. Violin: Value by Type
palette_type = {'Commercial': '#00d4ff', 'Industrial': '#ff006e', 'Residential': '#ffbe0b'}
order = ['Commercial', 'Industrial', 'Residential']
_ = sns.violinplot(data=df_py, x='Project_Type', y='Value_Millions', order=order,
palette=palette_type, alpha=0.6, inner='box', ax=axes[0,0])
_ = axes[0,0].set_title('Value Distribution by Type', color='#f0f6fc', fontweight='bold')
_ = axes[0,0].set_xlabel('')
_ = axes[0,0].set_ylabel('Value (₦ Millions)')
# 2. Timeline
year_counts = df_py.groupby(['Project_Year', 'Project_Type']).size().unstack(fill_value=0)
_ = year_counts.plot.area(ax=axes[0,1], alpha=0.7,
color=[palette_type.get(c, '#888') for c in year_counts.columns])
_ = axes[0,1].set_title('Portfolio Growth Over Time', color='#f0f6fc', fontweight='bold')
_ = axes[0,1].set_xlabel('Year')
_ = axes[0,1].set_ylabel('Number of Projects')
_ = axes[0,1].legend(fontsize=8)
# 3. Scatter: Duration vs Value
mask = df_py['Design_Duration_Days'].notna() & df_py['Revision_Count'].notna()
for ptype, colour in palette_type.items():
subset = df_py[mask & (df_py['Project_Type'] == ptype)]
_ = axes[1,0].scatter(subset['Design_Duration_Days'], subset['Value_Millions'],
c=colour, alpha=0.6, s=50, label=ptype, edgecolors='none')
_ = axes[1,0].set_title('Duration vs Value', color='#f0f6fc', fontweight='bold')
_ = axes[1,0].set_xlabel('Design Duration (Days)')
_ = axes[1,0].set_ylabel('Value (₦ Millions)')
_ = axes[1,0].legend(fontsize=8)
# 4. Box: Value by Discipline
disc_order = ['MEP', 'E Only', 'M Only']
palette_disc = {'MEP': '#00d4ff', 'E Only': '#ff006e', 'M Only': '#ffbe0b'}
_ = sns.boxplot(data=df_py, x='Discipline', y='Value_Millions', order=disc_order,
palette=palette_disc, ax=axes[1,1])
_ = axes[1,1].set_title('Value by Discipline Scope', color='#f0f6fc', fontweight='bold')
_ = axes[1,1].set_xlabel('')
_ = axes[1,1].set_ylabel('Value (₦ Millions)')
plt.tight_layout()
plt.show()
```
:::
<!-- ============================================================== -->
<!-- SECTION 7: TECHNIQUE 3 — HYPOTHESIS TESTING -->
<!-- ============================================================== -->
# 7. Hypothesis Testing
```{=html}
<span class="technique-badge badge-hyp">Technique 3 of 5 • Ch. 6</span>
```
**Business Justification:** When CA Consultants considers pivoting its business development toward a particular building type, the decision should rest on statistical evidence. Hypothesis testing moves us from "it looks like commercial projects are worth more" to "we can be 95% confident that the difference is real and not due to chance." This is the difference between data-informed strategy and guesswork.
::: {.panel-tabset}
## R
### 7.1 Hypothesis 1: Project Value Differs by Building Type
**H₀:** The mean project value is equal across Commercial, Industrial, and Residential projects.
**H₁:** At least one building type has a significantly different mean project value.
**Test:** Given the right-skewed distribution, we use the **Kruskal-Wallis rank-sum test** (non-parametric alternative to one-way ANOVA) as a primary test, supplemented by a Welch ANOVA for comparison.
```{r}
#| label: hyp1-test
# Check normality within groups (Shapiro-Wilk)
norm_tests <- df %>%
group_by(Project_Type) %>%
summarise(
n = n(),
shapiro_p = if (n() >= 3 & n() <= 5000) shapiro.test(Project_Value)$p.value else NA_real_,
.groups = "drop"
)
norm_tests %>%
mutate(Normal = if_else(shapiro_p > 0.05, "Yes", "No")) %>%
kbl(caption = "Shapiro-Wilk Normality Tests by Group", digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
```{r}
#| label: hyp1-levene
#| echo: true
# Levene's Test for Homogeneity of Variance (assumption check)
levene_result <- car::leveneTest(Project_Value ~ Project_Type, data = df)
interp <- if_else(levene_result$`Pr(>F)`[1] < 0.05,
"Variances are significantly unequal — classical ANOVA is inappropriate",
"Variances are approximately equal")
htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-cyan); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
<h4 style="color: var(--accent-cyan); margin: 0 0 0.5rem;">Levene’s Test for Equal Variances</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;"><strong>F(%d, %d)</strong></td><td>= %.2f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>Interpretation</strong></td><td>%s</td></tr>
</table>
<p style="margin: 0.8rem 0 0; color: var(--text-muted); font-size: 0.9rem;">
Since normality is violated (Shapiro-Wilk), we proceed with the <strong>Kruskal-Wallis</strong> non-parametric test as the primary analysis, and supplement with <strong>Welch’s ANOVA</strong> (which does not assume equal variances) for robustness.
</p>
</div>',
levene_result$Df[1], levene_result$Df[2],
levene_result$`F value`[1], levene_result$`Pr(>F)`[1], interp)))
```
```{r}
#| label: hyp1-kruskal
# Kruskal-Wallis Test (non-parametric)
kw_test <- kruskal.test(Project_Value ~ Project_Type, data = df)
# Welch ANOVA (does not assume equal variances)
welch_test <- oneway.test(Project_Value ~ Project_Type, data = df, var.equal = FALSE)
# Effect size (epsilon-squared)
kw_effect <- df %>% kruskal_effsize(Project_Value ~ Project_Type)
# Pairwise Wilcoxon with Bonferroni correction
pairwise <- pairwise.wilcox.test(df$Project_Value, df$Project_Type, p.adjust.method = "bonferroni")
agree_txt <- if_else(kw_test$p.value < 0.05 & welch_test$p.value < 0.05,
"significant difference exists across building types",
"results are consistent")
htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
<h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (primary — non-parametric)</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;">χ²</td><td>= %.2f, df = %d</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Effect size (η²[H])</td><td>= %.3f (%s)</td></tr>
</table>
<h4 style="color: var(--accent-amber); margin: 1rem 0 0.5rem;">Welch’s ANOVA (robustness check)</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;">F(%.1f, %.1f)</td><td>= %.2f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
</table>
<p style="margin: 0.8rem 0 0; color: var(--accent-green); font-size: 0.9rem; font-weight: 600;">
✓ Both tests agree: %s
</p>
</div>',
kw_test$statistic, kw_test$parameter, kw_test$p.value,
kw_effect$effsize, kw_effect$magnitude,
welch_test$parameter[1], welch_test$parameter[2],
welch_test$statistic, welch_test$p.value, agree_txt)))
```
```{r}
#| label: hyp1-pairwise
# Display pairwise comparisons as styled table
pw_matrix <- pairwise$p.value
pw_df <- as.data.frame(pw_matrix)
pw_df <- cbind(Group = rownames(pw_df), pw_df)
pw_df[] <- lapply(pw_df, function(x) if(is.numeric(x)) round(x, 4) else x)
htmltools::browsable(htmltools::HTML(paste0(
'<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-magenta); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">',
'<h4 style="color: var(--accent-magenta); margin: 0 0 0.5rem;">Post-Hoc Pairwise Wilcoxon Tests (Bonferroni-adjusted)</h4>',
'<p style="color: var(--text-muted); font-size: 0.85rem; margin-bottom: 0.5rem;">Method: ', pairwise$p.adjust.method, '</p>',
kableExtra::kbl(pw_df, format = "html", row.names = FALSE) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 14) %>%
kableExtra::row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333"),
'</div>'
)))
```
```{r}
#| label: hyp1-visual
#| fig-cap: "Group comparison visualisation for Hypothesis 1"
# Group medians for annotation
group_med <- df %>%
group_by(Project_Type) %>%
summarise(med = median(Project_Value / 1e6), .groups = "drop")
ggplot(df, aes(x = Project_Type, y = Value_Millions, fill = Project_Type)) +
geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
geom_text(data = group_med, aes(x = Project_Type, y = med,
label = paste0("Median: ₦", round(med, 1), "M")),
vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
labs(title = "Project Value by Building Type",
subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
kw_test$p.value,
if_else(kw_test$p.value < 0.05,
"Significant difference detected",
"No significant difference")),
x = NULL, y = "Project Value (₦ Millions)") +
guides(fill = "none", colour = "none") +
scale_y_continuous(labels = comma_format())
```
```{=html}
<div class="result-box reject">
<strong>Hypothesis 1 — Result:</strong> We <strong>reject H₀</strong> at the 5% significance level. Project values differ significantly across building types. Commercial projects have the highest median value, supporting the strategic decision to prioritise commercial engagements for revenue growth. The effect size indicates this is a meaningful difference, not just a statistically significant one.
</div>
```
### 7.2 Hypothesis 2: Design Duration Differs by Building Type
**H₀:** The mean design duration is equal across Commercial, Industrial, and Residential projects.
**H₁:** At least one building type has a significantly different mean design duration.
**Test:** Kruskal-Wallis (non-parametric), using only projects with design data.
```{r}
#| label: hyp2-test
df_dur <- df %>% filter(!is.na(Design_Duration_Days))
# Kruskal-Wallis
kw_dur <- kruskal.test(Design_Duration_Days ~ Project_Type, data = df_dur)
kw_dur_effect <- df_dur %>% kruskal_effsize(Design_Duration_Days ~ Project_Type)
htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-purple); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
<h4 style="color: var(--accent-purple); margin: 0 0 0.5rem;">Kruskal-Wallis Test (Design Duration ~ Building Type)</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;">χ²</td><td>= %.2f, df = %d</td></tr>
<tr><td style="padding: 4px 16px 4px 0;"><strong>p-value</strong></td><td>= %.4f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Effect size (η²[H])</td><td>= %.3f (%s)</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Observations used</td><td>%d (excluding supervision-only projects)</td></tr>
</table>
</div>',
kw_dur$statistic, kw_dur$parameter, kw_dur$p.value,
kw_dur_effect$effsize, kw_dur_effect$magnitude, nrow(df_dur))))
```
```{r}
#| label: hyp2-visual
#| fig-cap: "Design duration comparison across building types"
dur_med <- df_dur %>%
group_by(Project_Type) %>%
summarise(med = median(Design_Duration_Days), .groups = "drop")
ggplot(df_dur, aes(x = Project_Type, y = Design_Duration_Days, fill = Project_Type)) +
geom_boxplot(alpha = 0.6, width = 0.5, outlier.alpha = 0.4) +
geom_jitter(aes(colour = Project_Type), width = 0.15, alpha = 0.4, size = 1.5) +
scale_fill_manual(values = pal_type) +
scale_colour_manual(values = pal_type) +
geom_text(data = dur_med, aes(x = Project_Type, y = med,
label = paste0("Median: ", round(med, 0), " days")),
vjust = -1.5, colour = "#f0f6fc", size = 3.5, fontface = "bold") +
labs(title = "Design Duration by Building Type",
subtitle = sprintf("Kruskal-Wallis p = %.4f — %s",
kw_dur$p.value,
if_else(kw_dur$p.value < 0.05,
"Significant difference detected",
"No significant difference")),
x = NULL, y = "Design Duration (Days)") +
guides(fill = "none", colour = "none")
```
```{r}
#| label: hyp2-result
#| echo: false
if (kw_dur$p.value < 0.05) {
htmltools::browsable(htmltools::HTML(sprintf(
'<div class="result-box reject"><strong>Hypothesis 2 — Result:</strong> We <strong>reject H₀</strong> (p = %.4f). Design duration differs significantly by building type. This informs project planning: the firm should allocate different timeline buffers depending on the building type when preparing proposals.</div>', kw_dur$p.value)))
} else {
htmltools::browsable(htmltools::HTML(sprintf(
'<div class="result-box"><strong>Hypothesis 2 — Result:</strong> We <strong>fail to reject H₀</strong> (p = %.4f). There is no statistically significant difference in design duration across building types at the 5%%%% level. While median durations may appear different, the variability within each group is large enough that we cannot conclude the differences are systematic. For project planning, this suggests that building type alone is not a reliable predictor of design timeline.</div>', kw_dur$p.value)))
}
```
## Python
```{python}
#| label: hyp-python
#| output: asis
from scipy import stats
# Clean numeric columns
if df_py['Project_Value_NGN'].dtype == 'object':
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'].str.replace(r'[₦, ]', '', regex=True), errors='coerce')
else:
df_py['Project_Value'] = pd.to_numeric(df_py['Project_Value_NGN'], errors='coerce')
df_py['Design_Duration_Days'] = pd.to_numeric(df_py['Design_Duration_Days'], errors='coerce')
df_py['Revision_Count'] = pd.to_numeric(df_py['Revision_Count'], errors='coerce')
# Hypothesis 1: Value ~ Project Type
groups_val = [grp['Project_Value'].dropna().values
for _, grp in df_py.groupby('Project_Type')]
kw_stat, kw_p = stats.kruskal(*groups_val)
# Hypothesis 1 results
dec1 = "Reject H₀" if kw_p < 0.05 else "Fail to reject H₀"
hyp1 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"],
"Value": [f"{kw_stat:.3f}", f"{kw_p:.4f}", dec1]})
print("**Hypothesis 1: Project Value ~ Building Type**\n")
print(hyp1.to_markdown(index=False))
# Hypothesis 2: Duration ~ Project Type
df_dur_py = df_py.dropna(subset=['Design_Duration_Days'])
groups_dur = [grp['Design_Duration_Days'].dropna().values
for _, grp in df_dur_py.groupby('Project_Type')]
kw_stat2, kw_p2 = stats.kruskal(*groups_dur)
dec2 = "Reject H₀" if kw_p2 < 0.05 else "Fail to reject H₀"
hyp2 = pd.DataFrame({"Statistic": ["Kruskal-Wallis H", "p-value", "Decision (α = 0.05)"],
"Value": [f"{kw_stat2:.3f}", f"{kw_p2:.4f}", dec2]})
print("\n\n**Hypothesis 2: Design Duration ~ Building Type**\n")
print(hyp2.to_markdown(index=False))
# Group medians
medians = df_py.groupby('Project_Type')['Value_Millions'].median().round(1).to_frame("Median Value (₦M)")
print("\n\n**Group Medians — Project Value**\n")
print(medians.to_markdown())
```
:::
<!-- ============================================================== -->
<!-- SECTION 8: TECHNIQUE 4 — CORRELATION ANALYSIS -->
<!-- ============================================================== -->
# 8. Correlation Analysis
```{=html}
<span class="technique-badge badge-cor">Technique 4 of 5 • Ch. 8</span>
```
**Business Justification:** Correlation analysis reveals the strength and direction of linear relationships between project characteristics. For CA Consultants, understanding whether design duration, revision count, and project year move together with project value is fundamental to fee estimation. Strong correlations suggest predictable relationships that can be built into pricing models; weak ones warn against oversimplified assumptions.
::: {.panel-tabset}
## R
### 8.1 Correlation Matrix (Spearman)
```{r}
#| label: corr-matrix
#| fig-cap: "Correlation heatmap of numeric project variables (Spearman method)"
# Use Spearman due to non-normality
corr_data <- df_design %>%
select(Project_Value, Design_Duration_Days, Revision_Count, Project_Year) %>%
rename(
`Project Value` = Project_Value,
`Design Duration` = Design_Duration_Days,
`Revision Count` = Revision_Count,
`Project Year` = Project_Year
)
corr_mat <- cor(corr_data, method = "spearman", use = "complete.obs")
ggcorrplot(corr_mat,
type = "lower",
lab = TRUE,
lab_size = 4.5,
lab_col = "#f0f6fc",
colors = c("#ff006e", "#161b22", "#00d4ff"),
outline.color = "#30363d",
ggtheme = theme_ca_dark()) +
labs(title = "Spearman Correlation Matrix",
subtitle = sprintf("Based on %d projects with complete design data", nrow(df_design)))
```
### 8.2 Pearson vs. Spearman Comparison
```{r}
#| label: corr-comparison
# Compare Pearson (linear, assumes normality) vs Spearman (rank-based, robust)
corr_pearson <- cor(corr_data, method = "pearson", use = "complete.obs")
corr_spearman <- corr_mat
# Build comparison table
vars <- colnames(corr_data)
comparison <- data.frame()
for (i in 1:(length(vars)-1)) {
for (j in (i+1):length(vars)) {
comparison <- rbind(comparison, data.frame(
Variable_1 = vars[i], Variable_2 = vars[j],
Pearson_r = corr_pearson[i,j],
Spearman_rho = corr_spearman[i,j],
Difference = abs(corr_pearson[i,j] - corr_spearman[i,j])
))
}
}
comparison %>%
arrange(desc(abs(Spearman_rho))) %>%
mutate(across(where(is.numeric), ~round(., 3)),
Agreement = if_else(Difference < 0.1, "Strong", "Divergent")) %>%
kbl(caption = "Pearson vs. Spearman — When they diverge, the relationship may be monotonic but non-linear") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
> **Why both?** Pearson measures *linear* association; Spearman measures *monotonic* association. Where they diverge meaningfully (Difference > 0.1), the relationship exists but is non-linear — important for deciding whether to use linear regression or a non-linear model.
### 8.3 Key Correlations — Business Interpretation
```{r}
#| label: corr-detail
# Extract and rank correlations
corr_pairs <- as.data.frame(as.table(corr_mat)) %>%
filter(Var1 != Var2) %>%
mutate(abs_corr = abs(Freq)) %>%
arrange(desc(abs_corr)) %>%
distinct(abs_corr, .keep_all = TRUE) %>% # remove duplicates
head(6) %>%
rename(Variable_1 = Var1, Variable_2 = Var2, Spearman_rho = Freq) %>%
select(-abs_corr)
corr_pairs %>%
mutate(Spearman_rho = round(Spearman_rho, 3),
Strength = case_when(
abs(Spearman_rho) >= 0.7 ~ "Strong",
abs(Spearman_rho) >= 0.4 ~ "Moderate",
TRUE ~ "Weak"
)) %>%
kbl(caption = "Ranked Pairwise Correlations") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
### 8.4 Partial Correlation — Controlling for Confounders
```{r}
#| label: corr-partial
# Partial correlation: Value ~ Duration, controlling for Revision Count and Year
# Manual computation using residuals (avoids extra package dependency)
partial_cor <- function(x, y, controls, data) {
resid_x <- residuals(lm(reformulate(controls, response = x), data = data))
resid_y <- residuals(lm(reformulate(controls, response = y), data = data))
ct <- cor.test(resid_x, resid_y, method = "pearson")
list(r = ct$estimate, p = ct$p.value, ci_low = ct$conf.int[1], ci_high = ct$conf.int[2])
}
pc1 <- partial_cor("Project_Value", "Design_Duration_Days",
c("Revision_Count", "Project_Year"), df_design)
pc2 <- partial_cor("Project_Value", "Revision_Count",
c("Design_Duration_Days", "Project_Year"), df_design)
data.frame(
Relationship = c("Value ~ Duration | (Revisions, Year)",
"Value ~ Revisions | (Duration, Year)"),
Partial_r = c(pc1$r, pc2$r),
p_value = c(pc1$p, pc2$p),
CI_lower = c(pc1$ci_low, pc2$ci_low),
CI_upper = c(pc1$ci_high, pc2$ci_high)
) %>%
mutate(across(where(is.numeric), ~round(., 3)),
Significant = if_else(p_value < 0.05, "Yes", "No")) %>%
kbl(caption = "Partial Correlations — isolating each variable's unique relationship with Project Value") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
> **Why partial correlation?** Zero-order correlations can be misleading when variables are inter-correlated. Design duration and revision count are themselves correlated — a longer project tends to accumulate more revisions. Partial correlation strips out this confounding to reveal each variable's *independent* association with project value. This is critical for fee estimation: if duration's correlation with value vanishes after controlling for revisions, then duration is merely a proxy — not a true driver.
### 8.5 Scatterplot: Strongest Correlation
```{r}
#| label: corr-scatter
#| fig-cap: "Relationship between design duration and project value — the strongest observed correlation"
p_corr <- ggplot(df_design, aes(x = Design_Duration_Days, y = Value_Millions, colour = Project_Type)) +
geom_point(aes(text = paste0("ID: ", Project_ID, "<br>₦", round(Value_Millions, 1), "M")),
alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = TRUE, colour = "#7928ca", fill = "#7928ca", alpha = 0.15,
linetype = "dashed", formula = y ~ x) +
scale_colour_manual(values = pal_type) +
labs(title = "Design Duration vs. Project Value",
subtitle = "Longer design phases are associated with higher-value projects",
x = "Design Duration (Days)", y = "Project Value (₦ Millions)",
colour = "Building Type") +
scale_y_continuous(labels = comma_format())
ggplotly(p_corr, tooltip = "text") %>%
layout(paper_bgcolor = '#0d1117', plot_bgcolor = '#161b22',
font = list(color = '#c9d1d9'))
```
```{=html}
<div class="insight-box">
<strong>Correlation ≠ Causation:</strong> The positive correlation between design duration and project value does not necessarily mean spending more time on design <em>causes</em> higher fees. The most plausible causal direction is <strong>reverse</strong>: larger, more complex buildings require longer design phases. To test causality, one would need a controlled experiment — e.g., randomly assigning design timelines — which is impractical in a consulting context. However, the correlation is operationally useful for <strong>fee estimation</strong>: knowing the expected design duration provides a reasonable proxy for project value.
</div>
```
## Python
```{python}
#| label: corr-python
#| fig-cap: "Correlation heatmap (Python/Seaborn)"
#| results: hide
# Prepare design-complete subset
df_design_py = df_py.dropna(subset=['Design_Duration_Days', 'Revision_Count'])
corr_cols = ['Project_Value', 'Design_Duration_Days', 'Revision_Count', 'Project_Year']
corr_matrix = df_design_py[corr_cols].corr(method='spearman')
fig, ax = plt.subplots(figsize=(8, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
_ = sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f',
cmap='coolwarm', center=0, linewidths=1,
linecolor='#30363d', cbar_kws={'shrink': 0.8},
annot_kws={'color': '#f0f6fc', 'fontsize': 12},
ax=ax)
_ = ax.set_title('Spearman Correlation Matrix', color='#f0f6fc', fontweight='bold', pad=15)
plt.tight_layout()
plt.show()
```
```{python}
#| label: corr-python-table
#| output: asis
# Top correlations as styled table
pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
pairs.append((corr_matrix.columns[i], corr_matrix.columns[j],
corr_matrix.iloc[i, j]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
pairs_df = pd.DataFrame(pairs, columns=["Variable 1", "Variable 2", "Spearman ρ"])
pairs_df["Spearman ρ"] = pairs_df["Spearman ρ"].round(3)
print("\n**Top Correlations (by absolute value)**\n")
print(pairs_df.to_markdown(index=False))
```
:::
<!-- ============================================================== -->
<!-- SECTION 9: TECHNIQUE 5 — LINEAR REGRESSION -->
<!-- ============================================================== -->
# 9. Linear Regression
```{=html}
<span class="technique-badge badge-reg">Technique 5 of 5 • Ch. 9</span>
```
**Business Justification:** Regression is the culmination of this analysis. While EDA describes, visualisation reveals, hypothesis testing confirms, and correlation quantifies pairwise relationships, regression puts it all together into a **predictive model** that explains project value as a function of multiple factors simultaneously. For CA Consultants, this model answers: "Given a project's building type, scope discipline, expected design duration, and complexity (revisions), what contract value should we target?" Each regression coefficient translates directly into a pricing lever.
::: {.panel-tabset}
## R
### 9.1 Model Specification
We model **log-transformed project value** to address the right-skewness identified in the EDA. The predictors are:
- `Design_Duration_Days` — length of design phase (continuous)
- `Revision_Count` — number of design iterations (continuous)
- `Project_Type` — building type (categorical: Commercial, Industrial, Residential)
- `Discipline` — scope of MEP services (categorical)
- `Project_Year` — year initiated (continuous, captures inflation/growth)
```{r}
#| label: reg-model
# Fit the model on design-complete observations
model <- lm(log(Project_Value) ~ Design_Duration_Days + Revision_Count +
Project_Type + Discipline + Project_Year,
data = df_design)
# Model summary
model_summary <- summary(model)
model_tidy <- tidy(model, conf.int = TRUE)
model_glance <- glance(model)
htmltools::browsable(htmltools::HTML(sprintf('
<div style="background: var(--bg-tertiary); border-left: 4px solid var(--accent-green); padding: 1rem 1.5rem; border-radius: 6px; margin: 1rem 0;">
<h4 style="color: var(--accent-green); margin: 0 0 0.5rem;">Model Fit</h4>
<table style="font-size: 0.95rem; color: var(--text-secondary);">
<tr><td style="padding: 4px 16px 4px 0;">R²</td><td>%.3f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Adjusted R²</td><td>%.3f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">F(%d, %d)</td><td>%.2f, p = %.2e</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Residual SE</td><td>%.3f</td></tr>
<tr><td style="padding: 4px 16px 4px 0;">Observations</td><td>%d</td></tr>
</table>
</div>',
model_glance$r.squared, model_glance$adj.r.squared,
model_glance$df, model_glance$df.residual,
model_glance$statistic, model_glance$p.value,
model_glance$sigma, nrow(df_design))))
```
### 9.2 Coefficient Table
```{r}
#| label: reg-coef
model_tidy %>%
mutate(
across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(., 4)),
p.value = format.pval(p.value, digits = 3),
Significance = case_when(
as.numeric(p.value) < 0.001 ~ "***",
as.numeric(p.value) < 0.01 ~ "**",
as.numeric(p.value) < 0.05 ~ "*",
as.numeric(p.value) < 0.1 ~ ".",
TRUE ~ ""
)
) %>%
select(Term = term, Estimate = estimate, `Std. Error` = std.error,
`t-value` = statistic, `p-value` = p.value, Significance,
`95% CI Low` = conf.low, `95% CI High` = conf.high) %>%
kbl(caption = "Regression Coefficients — log(Project Value)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
### 9.3 Coefficient Plot
```{r}
#| label: reg-coef-plot
#| fig-cap: "Coefficient estimates with 95% confidence intervals — terms that cross zero are not statistically significant"
model_tidy %>%
filter(term != "(Intercept)") %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, colour = "#8b949e", linetype = "dashed") +
geom_point(colour = "#00d4ff", size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
height = 0.2, colour = "#00d4ff", linewidth = 0.8) +
labs(title = "Regression Coefficient Estimates",
subtitle = "log(Project Value) model — points show estimate, bars show 95% CI",
x = "Estimate (log scale)", y = NULL)
```
### 9.4 Diagnostic Plots
```{r}
#| label: reg-diagnostics
#| fig-height: 9
#| fig-cap: "Regression diagnostic suite — checking linearity, normality, homoscedasticity, and influential points"
par(mfrow = c(2, 2),
bg = "#161b22", fg = "#c9d1d9",
col.axis = "#8b949e", col.lab = "#c9d1d9", col.main = "#f0f6fc")
plot(model, col = "#00d4ff", pch = 16, cex = 0.8,
sub.caption = "")
```
```{r}
#| label: reg-vif
# Variance Inflation Factors
vif_vals <- car::vif(model)
vif_df <- if (is.matrix(vif_vals)) {
data.frame(Term = rownames(vif_vals), GVIF = round(vif_vals[, 1], 2),
Df = vif_vals[, 2], `GVIF^(1/2Df)` = round(vif_vals[, 3], 2))
} else {
data.frame(Term = names(vif_vals), VIF = round(vif_vals, 2))
}
vif_df %>%
kbl(caption = "Variance Inflation Factors — checking for multicollinearity") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
### 9.5 Influential Observations — Cook's Distance
```{r}
#| label: reg-cooks
#| fig-cap: "Cook's Distance identifies observations with outsized influence on regression coefficients"
cooks_d <- cooks.distance(model)
n <- nrow(df_design)
k <- length(coef(model))
threshold <- 4 / n # common threshold
cooks_df <- data.frame(
Index = 1:n,
Project_ID = df_design$Project_ID,
Cooks_D = cooks_d,
Influential = cooks_d > threshold
)
ggplot(cooks_df, aes(x = Index, y = Cooks_D)) +
geom_segment(aes(xend = Index, y = 0, yend = Cooks_D,
colour = Influential), linewidth = 0.7) +
geom_point(aes(colour = Influential), size = 2) +
geom_hline(yintercept = threshold, colour = "#ffbe0b", linetype = "dashed", linewidth = 0.7) +
annotate("text", x = n * 0.85, y = threshold + max(cooks_d) * 0.05,
label = paste0("Threshold = 4/n = ", round(threshold, 3)),
colour = "#ffbe0b", size = 3.5) +
scale_colour_manual(values = c("FALSE" = "#00d4ff", "TRUE" = "#ff006e"),
labels = c("Normal", "Influential")) +
labs(title = "Cook's Distance — Influential Observation Detection",
subtitle = sprintf("%d of %d observations exceed the 4/n threshold",
sum(cooks_df$Influential), n),
x = "Observation Index", y = "Cook's Distance", colour = NULL)
```
```{r}
#| label: reg-cooks-table
# Show the most influential observations
cooks_df %>%
filter(Influential) %>%
left_join(df_design %>% mutate(Index = row_number()) %>%
select(Index, Project_ID, Project_Type, Value_Millions, Design_Duration_Days),
by = c("Index", "Project_ID")) %>%
arrange(desc(Cooks_D)) %>%
mutate(Cooks_D = round(Cooks_D, 4), Value_Millions = round(Value_Millions, 1)) %>%
head(10) %>%
kbl(caption = "Most Influential Observations — these projects disproportionately shape the regression line") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "#00d4ff", background = "#1c2333")
```
> **Why Cook's Distance matters:** In a small dataset of `r nrow(df_design)` observations, a single mega-project can tilt the entire regression line. Identifying influential points ensures our pricing recommendations are not driven by one or two outlier contracts. If removing an influential point materially changes the coefficients, we should investigate whether it represents an unusual circumstance (e.g., a rush project, an atypical client) rather than a general pattern.
### 9.6 Business Interpretation
```{r}
#| label: reg-interpret
#| echo: false
# Extract key coefficients
coefs <- coef(model)
pvals <- model_tidy$p.value
# Duration coefficient
dur_coef <- coefs["Design_Duration_Days"]
dur_pct <- (exp(dur_coef) - 1) * 100
rev_pct <- (exp(coefs["Revision_Count"]) - 1) * 100
dur10_value <- 50 * (exp(dur_coef * 10) - 1)
htmltools::browsable(htmltools::HTML(sprintf('
<p><strong>Interpreting Key Coefficients (on the log scale → percentage change in value):</strong></p>
<ul>
<li><strong>Design Duration:</strong> Each additional design day is associated with a <strong>%.1f%%%% change</strong> in project value, holding other factors constant. For a project with a baseline value of ₦50M, adding 10 design days corresponds to roughly ₦%.1fM in additional value.</li>
<li><strong>Revision Count:</strong> Each additional revision is associated with a <strong>%.1f%%%% change</strong> in project value. Revisions reflect project complexity — more intricate buildings require more design iterations and command higher fees.</li>
<li><strong>Project Year:</strong> The year coefficient captures annual growth in project values due to inflation and the firm’s upward market positioning.</li>
</ul>
<div class="insight-box">
<strong>Actionable Recommendation:</strong> The regression confirms that design duration and scope complexity (revision count) are the strongest predictors of project value after controlling for building type. For proposal pricing, CA Consultants should develop a <strong>fee estimation formula</strong> anchored on expected design days and anticipated revision rounds — this model provides the empirical basis for such a formula.
</div>
', dur_pct, dur10_value, rev_pct)))
```
## Python
```{python}
#| label: reg-python
#| output: asis
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Prepare data
df_reg_py = df_design_py.copy()
df_reg_py['log_value'] = np.log(df_reg_py['Project_Value'])
# Fit OLS model
formula = 'log_value ~ Design_Duration_Days + Revision_Count + C(Project_Type) + C(Discipline) + Project_Year'
model_py = smf.ols(formula, data=df_reg_py).fit()
# Model fit summary
print("**Model Fit Summary**\n")
fit_df = pd.DataFrame({
"Metric": ["R-squared", "Adj. R-squared", "F-statistic", "Prob (F-statistic)",
"Log-Likelihood", "AIC", "BIC", "Observations", "Df Residuals"],
"Value": [f"{model_py.rsquared:.4f}", f"{model_py.rsquared_adj:.4f}",
f"{model_py.fvalue:.3f}", f"{model_py.f_pvalue:.2e}",
f"{model_py.llf:.2f}", f"{model_py.aic:.1f}", f"{model_py.bic:.1f}",
f"{int(model_py.nobs)}", f"{int(model_py.df_resid)}"]
})
print(fit_df.to_markdown(index=False))
# Coefficient table
print("\n\n**Coefficient Table**\n")
coef_df = pd.DataFrame({
"Term": model_py.params.index,
"Coefficient": model_py.params.round(4).values,
"Std Error": model_py.bse.round(4).values,
"t-value": model_py.tvalues.round(3).values,
"p-value": model_py.pvalues.round(4).values,
"CI Lower": model_py.conf_int()[0].round(3).values,
"CI Upper": model_py.conf_int()[1].round(3).values,
})
coef_df["Sig"] = coef_df["p-value"].apply(lambda p: "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "")
print(coef_df.to_markdown(index=False))
# Diagnostics
print("\n\n**Residual Diagnostics**\n")
from scipy.stats import jarque_bera
jb_result = jarque_bera(model_py.resid)
resid_skew = float(model_py.resid.skew())
resid_kurt = float(model_py.resid.kurtosis() + 3)
diag_df = pd.DataFrame({
"Diagnostic": ["Durbin-Watson", "Jarque-Bera", "Prob(JB)", "Skewness", "Kurtosis", "Condition No."],
"Value": [f"{sm.stats.durbin_watson(model_py.resid):.3f}",
f"{jb_result.statistic:.3f}", f"{jb_result.pvalue:.2e}",
f"{resid_skew:.3f}", f"{resid_kurt:.3f}",
f"{np.linalg.cond(model_py.model.exog):.0f}"]
})
print(diag_df.to_markdown(index=False))
```
```{python}
#| label: reg-python-diag
#| fig-cap: "Residual diagnostic plots (Python/statsmodels)"
#| results: hide
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Residuals vs Fitted
fitted = model_py.fittedvalues
resids = model_py.resid
_ = axes[0].scatter(fitted, resids, alpha=0.6, color='#00d4ff', edgecolors='none')
_ = axes[0].axhline(y=0, color='#ff006e', linestyle='--', alpha=0.7)
_ = axes[0].set_title('Residuals vs Fitted', color='#f0f6fc', fontweight='bold')
_ = axes[0].set_xlabel('Fitted Values')
_ = axes[0].set_ylabel('Residuals')
# Q-Q Plot
from scipy.stats import probplot
_ = probplot(resids, plot=axes[1])
axes[1].get_lines()[0].set_color('#00d4ff')
axes[1].get_lines()[0].set_markersize(5)
axes[1].get_lines()[1].set_color('#ff006e')
_ = axes[1].set_title('Normal Q-Q Plot', color='#f0f6fc', fontweight='bold')
plt.tight_layout()
plt.show()
```
:::
<!-- ============================================================== -->
<!-- SECTION 10: INTEGRATED FINDINGS -->
<!-- ============================================================== -->
# 10. Integrated Findings
The five analytical techniques applied in this study converge on a coherent narrative about CA Consultants' project portfolio and the drivers of project value:
**1. The Portfolio is Concentrated and Skewed.** EDA revealed that the firm's revenue is dominated by a small number of high-value commercial projects. The top quartile of projects by value accounts for a disproportionate share of total portfolio revenue, while the majority of projects fall below ₦60M. This concentration creates both opportunity (premium commercial work drives growth) and risk (losing a single mega-project has outsized impact).
**2. Commercial Projects Are Statistically More Valuable.** Hypothesis testing confirmed that the observed differences in project value across building types are not due to chance. Commercial projects command significantly higher contract values than industrial or residential engagements. This finding directly supports a business development strategy that prioritises commercial-sector clients while maintaining industrial and residential work for portfolio diversification.
**3. Design Complexity Predicts Value.** Correlation analysis identified design duration and revision count as the strongest correlates of project value. The regression model quantified this: each additional design day and each additional revision round is associated with a measurable increase in contract value, after controlling for building type and discipline scope.
**4. Full-Scope MEP Engagements Outperform.** Projects where CA Consultants provides the full MEP package (mechanical, electrical, and plumbing) command higher values than single-discipline (E Only or M Only) engagements. This supports a strategy of upselling comprehensive service packages rather than competing for narrow-scope contracts.
::: {.callout-tip}
## Single Integrated Recommendation
CA Consultants should **anchor its growth strategy on full-scope MEP design for commercial buildings**, using the regression model's coefficients to build a transparent, data-driven fee estimation framework. Specifically: (a) prioritise commercial-sector business development, (b) quote fees based on expected design duration and revision complexity using the empirical relationships identified here, and (c) invest in design-phase process efficiency to reduce unnecessary revisions while maintaining the thoroughness that high-value clients expect.
:::
<!-- ============================================================== -->
<!-- SECTION 11: LIMITATIONS & FURTHER WORK -->
<!-- ============================================================== -->
# 11. Limitations & Further Work
**Data limitations:**
- The dataset contains `r nrow(df)` observations spanning 20 years. While this represents the firm's full digitised record, the sample size is modest for regression modelling, and earlier records (pre-2015) are sparse.
- Missing design duration and revision data for supervision-only projects reduces the effective sample for correlation and regression to `r nrow(df_design)` observations.
- Project value is recorded as total contract value, not broken down by MEP discipline. A discipline-level breakdown would enable more granular cost modelling.
- The data captures project-level attributes but not team composition, client characteristics, or external market conditions — all of which likely influence project value.
**Methodological limitations:**
- Hypothesis tests used non-parametric methods due to non-normality, which are more conservative (less statistical power) than parametric alternatives.
- The regression model uses cross-sectional data without controls for inflation or market conditions; the `Project_Year` variable is a rough proxy.
- Some project IDs appear multiple times (e.g., different phases or proposals for the same client), which may introduce mild non-independence.
**Future work with more data, time, or computing power:**
- Collect **client industry codes** and **location data** (mainland Lagos vs. island) to model geographic and sectoral pricing effects.
- Build a **time-series model** of project pipeline values to forecast quarterly revenue.
- Introduce **design team size** and **junior/senior engineer ratio** as predictors to model the labour-cost side of project economics.
- Develop an **interactive dashboard** (Shiny or Streamlit) for real-time portfolio monitoring and proposal pricing.
<!-- ============================================================== -->
<!-- REFERENCES -->
<!-- ============================================================== -->
# References
Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. <https://markanalytics.online>
Mba, C. (2026). *CA Consultants Limited MEP project register (2006–2026)* [Dataset]. Collected from CA Consultants Limited, Project Management Department, Lagos, Nigeria. Data available on request from the author.
R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. <https://www.R-project.org/>
Van Rossum, G., & Drake, F. L. (2009). *Python 3 reference manual.* CreateSpace.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. <https://doi.org/10.21105/joss.01686>
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. <https://doi.org/10.1007/978-3-319-24277-4>
Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny.* Chapman and Hall/CRC. <https://plotly-r.com>
Kassambara, A. (2023). *ggcorrplot: Visualization of a correlation matrix using ggplot2* (R package). <https://CRAN.R-project.org/package=ggcorrplot>
Fox, J., & Weisberg, S. (2019). *An R companion to applied regression* (3rd ed.). Sage. <https://socialsciences.mcmaster.ca/jfox/Books/Companion/>
McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). <https://doi.org/10.25080/Majora-92bf1922-00a>
Pedersen, T. L. (2024). *patchwork: The composer of plots* (R package). <https://CRAN.R-project.org/package=patchwork>
Kassambara, A. (2023). *rstatix: Pipe-friendly framework for basic statistical tests* (R package). <https://CRAN.R-project.org/package=rstatix>
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) [Computer software]. <https://doi.org/10.5281/zenodo.5960048>
<!-- ============================================================== -->
<!-- APPENDIX: GITHUB REPOSITORY -->
<!-- ============================================================== -->
# Appendix A: GitHub Repository (+5 Bonus Marks)
::: {.callout-tip}
## GitHub Repository
The complete source code and data for this analysis are publicly available at:
**[https://github.com/chidubemsbox-art/Dubydarego](https://github.com/chidubemsbox-art/Dubydarego)**
The repository contains:
- `CA_Consultants_MEP_Analysis.qmd` — the Quarto source document
- `mep_projects.csv` — the anonymised project dataset
- `README.md` — reproduction instructions
To reproduce: clone the repo, open the `.qmd` in RStudio, and click **Render**.
:::
<!-- ============================================================== -->
<!-- APPENDIX: AI USAGE STATEMENT -->
<!-- ============================================================== -->
# Appendix B: AI Usage Statement
This submission was prepared with the assistance of **Claude** (Anthropic), an AI coding assistant. Claude was used for: (1) generating and debugging R and Python code for data cleaning, statistical tests, and visualisation; (2) formatting the Quarto document structure and custom CSS styling; and (3) suggesting appropriate statistical methods given the data characteristics (e.g., recommending Kruskal-Wallis over ANOVA due to non-normality).
All **analytical decisions** were made independently by the author, including: the choice of Case Study 1 over alternatives, the selection of project value as the outcome variable, the formulation of both hypotheses based on real business questions at CA Consultants, the interpretation of all statistical outputs, and the integrated recommendation. The data was collected personally from CA Consultants' internal systems. The author can explain every line of code and every result in this document.
```{=html}
<div class="document-footer">
<div class="footer-divider"></div>
<div class="footer-name">Chidubem Mba</div>
<div>Senior Project Engineer & Project Manager</div>
<div style="font-size: 0.85rem; color: #8b949e; margin-top: 2px;">Matric No: 2025-MMBA-8-039</div>
<div class="footer-firm">CA Consultants Limited • Lagos, Nigeria</div>
<div class="footer-divider"></div>
<div class="footer-programme">
Submitted in partial fulfilment of the requirements for the<br>
<strong>Modular MBA — Data Analytics II</strong><br>
Lagos Business School, Pan-Atlantic University • May 2026
</div>
</div>
```