What factors cause delays in cybersecurity audit projects?

Author

Bisola Oladejo

Published

May 19, 2026

Executive Summary

Project delays are the most common issue we face in our cybersecurity advisory practice, with 72% of engagements running late. This paper was designed to understand what actually causes these delays and how the advisory team can avoid them.

Data for 100 assessment completed projects (PCI DSS, VAPT, ISO 27001, SWIFT, IMS, and FORENSICS) was gathered through a corporate survey of consultants. Five analytical techniques are used: Exploratory data analysis (EDA), Data visualisation, Test the hypothesis, Correlation, and Logistic Regression.

We identified two behavioural predictors of delay in the logistic regression: both responsiveness (OR = 0.58, p = 0.075) and security maturity (OR = 0.57, p = 0.047) reduced the odds of delay by more than 40% per point. The t-test separately confirmed a significant difference in responsiveness between on-time and delayed projects (p = 0.018, d = 0.545)

The most important suggestion is to measure pre-engagement maturity on all clients, and to monitor responses after the meetings. These interventions are low-cost and directly address the causes of delay that the data reveal, allowing for immediate application.

Professional Disclosure

Job Title and Organisation

Senior Manager | Head Of Department (Cybersecurity & Compliance Advisory) at Digital Encode Limited

I work in the cybersecurity consulting and compliance advisory sector. I lead and support cybersecurity compliance assessment projects for clients in the banking, fintech, telecoms and enterprise sectors in Nigeria. My work includes the coordination of PCI DSS, SWIFT CSP, ISO 27001, vulnerability assessment and other related cybersecurity projects.

This examination tackles the practical issue of identifying the critical variables affecting or influencing project duration, project delay, and security risk profile for engagement planning for cybersecurity compliance assessment. The results of the analysis will aid in planning, resourcing, risk management and client engagement.

Technique Justification

Each of the five analytical techniques applied in this case study was chosen because it addresses a specific operational question I face in my daily work.

Exploratory Data Analysis (EDA): Before any engagement begins, I need to understand the landscape. EDA allows me to examine and identify data quality issues in our clients environments, policies and procedures. In operational terms, EDA is the equivalent of scoping a new engagement,you cannot access or fix what you have not measured.

Data Visualisation: I regularly present project performance summaries to departmental leadership and to clients. Visualisation transforms raw project data into patterns that non-technical stakeholders can immediately grasp. The plots in this analysis form a narrative I could present directly to a management committee considering project decisions.

Hypothesis Testing: In advisory work, we frequently make assumptions: “remote projects experience more delays,” or “banks perform better than fintechs.” Hypothesis testing replaces professional guesses with evidence. By formulating null and alternative hypotheses, checking assumptions, and reporting, I can determine which perceived patterns are statistically real and which are noise. This is directly applicable when reviewing project delivery policies. For instance, whether delivery mode genuinely affects project timelines.

Correlation Analysis: Understanding which project variables move together helps me advise clients on where to focus their preparation efforts. If security maturity correlates with compliance score, then pre-engagement maturity assessments become a defensible recommendation, not just an opinion. The correlation matrix also identifies redundancy — variables that measure the same underlying construct — which prevents duplicate effort in data collection and client reporting.

Linear Regression: The most operationally valuable question I can answer is: “If we improve client readiness by x, how much does that reduce the risk of a project running late?” Logistic regression quantifies the change in the odds of a delay for each unit improvement in a predictor, holding other factors constant. That turns a generic suggestion into a testable business case.

Data Collection & Sampling

Data Provenance Notes

The dataset for this analysis was collected through an Internal structured survey, utilising the company‘s historical cybersecurity compliance assessment projects from 2023 to 2026, which was approved by the chief project manager and associate director, and was filled out by the cybersecurity consultants and clients’ project team members.

It captures the project-level data like project duration, type of assessment, team size, number of systems in scope, responsiveness of the client, Security Maturity, compliance score and number of vulnerabilities.

In order to maintain confidentiality and adhere to ethical considerations, all Personally Identifiable Information (PII), client-identifying information and sensitive operational information were removed or altered before analysis. The data set is used solely for academic purposes and will be part of the Executive MBA Data Analytics Capstone assessment at Lagos Business School.

Respondents to the survey were not obliged to participate. All individual responses will be anonymised and reported on as averages/summaries only. Part of the information gathered relates to sensitive client-specific data and business operational data, and this will not be published in the final report.

Business Question

I am interested in understanding and accurately predicting the factors that affect project duration(delayed projects) and security risk in cybersecurity compliance assessment projects, based on historical project and vulnerability assessment data, since this information guides decisions on project pricing and scheduling, resource allocation, and risk management in the execution of cybersecurity projects.

Data Description and Exploratory Data Analysis (EDA)

The dataset used in this project is a collection of 100 past security assessment projects from various departments, including PCI DSS, ISO 27001, SWIFT CSP, vulnerability assessments, and forensic engagements. It contains a mix of structured categorical and numerical variables.

Categorical Variables The dataset includes the following categorical attributes:

Client industry Approximate client organisation size (Small, Medium, Large, Enterprise) Project delivery mode Type of cybersecurity assessment Client identifier Project delay status (binary outcome: On-time or Delayed)

These variables describe the context of each project and serve as the segmentation variables for the analysis.

Numerical Variables The dataset also contains several continuous or ordinal numerical variables:

Client responsiveness level (1–5 scale) Security maturity of client at project start (1–5 scale) Overall compliance score (%) Total duration of the project (days) Total number of vulnerabilities identified Number of critical/high vulnerabilities Number of systems and/or applications in scope Number of consultants assigned Number of client meetings held during the engagement

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import (
    ttest_ind, chi2_contingency, f_oneway, 
    mannwhitneyu, kruskal, pearsonr, spearmanr
)
import regex as re
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import classification_report, roc_curve, auc, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10

# Load the dataset 
df = pd.read_csv('dataset.csv')
print(f"Columns: {df.columns.tolist()}")
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print()

Columns: ['Timestamp', 'Type of cybersecurity assessment', 'Client identifier', 'Project delivery mode', 'Client industry', 'Approximate client organisation size', 'Years of Assessment/Project', ' Total number of consultants on the project', 'Number of systems and/or applications in scope', 'Approximate Number of meetings held with the client during the project', 'Client responsiveness level', ' Security maturity of client at project start', 'Total duration of the project (days)', 'Did the project experience delays?', 'Overall compliance score (%)', 'Total number of vulnerabilities identified', 'Number of critical/high vulnerabilities']
Dataset shape: 100 rows × 17 columns

Cleaning the data columns

From the previous cell, whitespaces are present in the column names,a sensitive column with the client names and responder names columns, and an empty column. I will be handling these issues: Firstly, eliminate whitespaces from the column names so they can be referenced easily wherever they are needed, then drop the selected columns.

Code

df.columns = df.columns.str.strip()
df = df.drop(columns=['Column 18', 'RESPONDER NAME & DEPARTMENT', 'Client name'], errors='ignore')

df.to_csv('assessment_clean.csv', index=False)
print(f"Original shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.info()

Original shape: (100, 17)
Columns: ['Timestamp', 'Type of cybersecurity assessment', 'Client identifier', 'Project delivery mode', 'Client industry', 'Approximate client organisation size', 'Years of Assessment/Project', 'Total number of consultants on the project', 'Number of systems and/or applications in scope', 'Approximate Number of meetings held with the client during the project', 'Client responsiveness level', 'Security maturity of client at project start', 'Total duration of the project (days)', 'Did the project experience delays?', 'Overall compliance score (%)', 'Total number of vulnerabilities identified', 'Number of critical/high vulnerabilities']
<class 'pandas.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column                                                                  Non-Null Count  Dtype
---  ------                                                                  --------------  -----
 0   Timestamp                                                               100 non-null    str  
 1   Type of cybersecurity assessment                                        100 non-null    str  
 2   Client identifier                                                       100 non-null    str  
 3   Project delivery mode                                                   100 non-null    str  
 4   Client industry                                                         100 non-null    str  
 5   Approximate client organisation size                                    100 non-null    str  
 6   Years of Assessment/Project                                             100 non-null    int64
 7   Total number of consultants on the project                              100 non-null    int64
 8   Number of systems and/or applications in scope                          100 non-null    str  
 9   Approximate Number of meetings held with the client during the project  100 non-null    int64
 10  Client responsiveness level                                             100 non-null    int64
 11  Security maturity of client at project start                            100 non-null    int64
 12  Total duration of the project (days)                                    100 non-null    str  
 13  Did the project experience delays?                                      100 non-null    str  
 14  Overall compliance score (%)                                            99 non-null     str  
 15  Total number of vulnerabilities identified                              100 non-null    str  
 16  Number of critical/high vulnerabilities                                 100 non-null    int64
dtypes: int64(6), str(11)
memory usage: 13.4 KB

Handling missing values, checking for duplicates and standardizing numeric fields currently stored as strings

There is a missing value in the Overall Compliance Column and some supposed integer columns are represented as strings, this will be handled by filling compliance score with the median value to avoid dropping rows which will in turn reduce the quality of data. Also check for duplicates and handle different data types columns in subsequent cells.

Code

#Change compliance values to float, so missing value can be filled with the median value
def clean_compliance(val):
    if pd.isna(val):
        return np.nan
    val = str(val).strip().replace('%', '')
    try:
        return float(val)
    except ValueError:
        return np.nan

df['Overall compliance score (%)'] = df['Overall compliance score (%)'].apply(clean_compliance)
df['Overall compliance score (%)'] = df['Overall compliance score (%)'].fillna(df['Overall compliance score (%)'].median())
print(df.isna().sum())

#Check for duplicate rows
print(f"duplicate rows: {df.duplicated().sum()}")

Timestamp                                                                 0
Type of cybersecurity assessment                                          0
Client identifier                                                         0
Project delivery mode                                                     0
Client industry                                                           0
Approximate client organisation size                                      0
Years of Assessment/Project                                               0
Total number of consultants on the project                                0
Number of systems and/or applications in scope                            0
Approximate Number of meetings held with the client during the project    0
Client responsiveness level                                               0
Security maturity of client at project start                              0
Total duration of the project (days)                                      0
Did the project experience delays?                                        0
Overall compliance score (%)                                              0
Total number of vulnerabilities identified                                0
Number of critical/high vulnerabilities                                   0
dtype: int64
duplicate rows: 0

Code

# Cleaning Total number of vulnerabilities identified and Number of critical/high vulnerabilities
# Problem: Mixed formats:
#   - Ranges: "50 to 100", "0 to 50"
#   - Text: "65 - 70", "52 vulnerabilities"
#   - Comparators: ">150", ">200"
#   - Words: "Nil", "None", "Less than 20"
# Solution: Extract all numbers, take midpoint for ranges,
#           single number for exact values, preserve >N as N

def clean_vuln_range(val):
    if pd.isna(val):
        return np.nan
    val = str(val).strip().lower()
    if val in ['nil', 'none', '0', '']:
        return 0.0
    if val.startswith('>'):
        num = ''.join(c for c in val if c.isdigit())
        return float(num) if num else np.nan
    if 'less than' in val:
        num = ''.join(c for c in val if c.isdigit())
        return float(num) * 0.5 if num else np.nan
    numbers = re.findall(r'\d+', val)
    if len(numbers) >= 2:
        return np.mean([float(n) for n in numbers[:2]])
    elif len(numbers) == 1:
        return float(numbers[0])
    else:
        return np.nan

df['Total number of vulnerabilities identified'] = df['Total number of vulnerabilities identified'].apply(clean_vuln_range)
df['Number of critical/high vulnerabilities'] = df['Number of critical/high vulnerabilities'].apply(clean_vuln_range)

print(df[['Total number of vulnerabilities identified', 'Number of critical/high vulnerabilities']].head())

   Total number of vulnerabilities identified  \
0                                        75.0   
1                                        25.0   
2                                        75.0   
3                                        25.0   
4                                        25.0   

   Number of critical/high vulnerabilities  
0                                      4.0  
1                                      5.0  
2                                      3.0  
3                                     20.0  
4                                      2.0

Code

# CLEAN PROJECT DURATION 
# Problem: Column contains mixed formats:
#   - Plain numbers: "90", "14"
#   - With text: "100 working days", "3 months", "15 days"
# Solution: Extract numeric value, handle units
#  Total duration of the project (days) 

import pandas as pd
import re

def simple_duration_fix(val):
    """
    If the string contains 'month' → extract first number and multiply by 30.
    If it contains 'day'    → extract first number and use it directly.
    Otherwise, return the value unchanged.
    """
    if pd.isna(val):
        return val
    val = str(val).strip().lower()
    
    # Only act if 'month' or 'day' is present
    if 'month' in val:
        nums = re.findall(r'\d+', val)
        if nums:
            return min(float(nums[0]) * 30, 2000)   # cap at 2000 days
    elif 'day' in val:
        nums = re.findall(r'\d+', val)
        if nums:
            return min(float(nums[0]), 2000)
    
    # For everything else (plain numbers, ranges, ">200"), return as-is
    return val

df['Total duration of the project (days)'] = df['Total duration of the project (days)'].apply(simple_duration_fix)

df['Total duration of the project (days)'] = pd.to_numeric(df['Total duration of the project (days)'], errors='coerce')

Code

# CLEAN: Approximate client organisation size 

def clean_size(val):
    if pd.isna(val):
        return np.nan
    val = str(val).strip().lower()
    if 'small' in val:
        return 'Small'
    elif 'medium' in val:
        return 'Medium'
    elif 'large' in val:
        return 'Large'
    elif 'enterprise' in val:
        return 'Enterprise'
    return val

df['Approximate client organisation size'] = df['Approximate client organisation size'].apply(clean_size)
print("Cleaned: Approximate client organisation size")
print(df['Approximate client organisation size'].value_counts())
df.head()

Cleaned: Approximate client organisation size
Approximate client organisation size
Small         34
Medium        26
Enterprise    24
Large         16
Name: count, dtype: int64

	Timestamp	Type of cybersecurity assessment	Client identifier	Project delivery mode	Client industry	Approximate client organisation size	Years of Assessment/Project	Total number of consultants on the project	Number of systems and/or applications in scope	Approximate Number of meetings held with the client during the project	Client responsiveness level	Security maturity of client at project start	Total duration of the project (days)	Did the project experience delays?	Overall compliance score (%)	Total number of vulnerabilities identified	Number of critical/high vulnerabilities
0	05/10/2026 11:53	PCI DSS	CLIENT 001	Fully Remote	Banking / Financial Services	Large	2024	3	0 to 50	10	3	4	90.0	Yes	70.0	75.0	4.0
1	05/10/2026 11:56	PCI DSS	CLIENT 002	Fully Remote	Banking / Financial Services	Large	2025	3	0 to 50	15	3	3	150.0	Yes	80.0	25.0	5.0
2	05/10/2026 12:08	PCI DSS	CLIENT 003	Hybrid	Banking / Financial Services	Enterprise	2025	2	100 to 150	10	2	4	210.0	Yes	80.0	75.0	3.0
3	05/10/2026 12:25	ISO 27001	CLIENT 004	Hybrid	Banking / Financial Services	Large	2023	2	50 to 100	80	3	3	100.0	Yes	85.0	25.0	20.0
4	05/10/2026 12:32	PCI DSS	CLIENT 005	Fully Remote	Banking / Financial Services	Small	2025	2	0 to 50	15	3	2	140.0	Yes	95.0	25.0	2.0

Code

import re

def clean_range(val):
    """Convert range strings like '0 to 50', '> 200' to numeric midpoint."""
    if pd.isna(val):
        return np.nan
    val = str(val).strip().lower()
    if val in ['nil', 'none', '', '0']:
        return 0.0
    if val.startswith('>'):
        num = ''.join(c for c in val if c.isdigit())
        return float(num) if num else np.nan
    if 'less than' in val:
        num = ''.join(c for c in val if c.isdigit())
        return float(num) * 0.5 if num else np.nan
    # Extract all numbers
    numbers = re.findall(r'\d+', val)
    if len(numbers) >= 2:
        return np.mean([float(n) for n in numbers[:2]])
    elif len(numbers) == 1:
        return float(numbers[0])
    else:
        return np.nan

# Apply the cleaner
df['Number of systems and/or applications in scope'] = df['Number of systems and/or applications in scope'].apply(clean_range)

# Verify
print(f"After cleaning: {df['Number of systems and/or applications in scope'].dropna().shape[0]} valid rows")
print(f"Sample values: {df['Number of systems and/or applications in scope'].dropna().head(10).tolist()}")

After cleaning: 100 valid rows
Sample values: [25.0, 25.0, 125.0, 75.0, 25.0, 25.0, 200.0, 25.0, 25.0, 25.0]

Code

# Standardise formats
df['Type of cybersecurity assessment'] = df['Type of cybersecurity assessment'].str.strip().str.upper()
df['Project delivery mode'] = df['Project delivery mode'].str.strip().str.title()
df['Client identifier'] = df['Client identifier'].str.strip().str.upper()

# Return Yes or No as 1 or 0
# strip whitespace, lowercase, then map
df['Did the project experience delays?'] = (
    df['Did the project experience delays?']
    .str.strip()
    .str.lower()
    .map({'yes': 1, 'no': 0})
)

# Numeric conversions
df['Client responsiveness level'] = pd.to_numeric(df['Client responsiveness level'], errors='coerce')
df['Security maturity of client at project start'] = pd.to_numeric(df['Security maturity of client at project start'], errors='coerce')
df['Years of Assessment/Project'] = pd.to_numeric(df['Years of Assessment/Project'], errors='coerce')

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column                                                                  Non-Null Count  Dtype  
---  ------                                                                  --------------  -----  
 0   Timestamp                                                               100 non-null    str    
 1   Type of cybersecurity assessment                                        100 non-null    str    
 2   Client identifier                                                       100 non-null    str    
 3   Project delivery mode                                                   100 non-null    str    
 4   Client industry                                                         100 non-null    str    
 5   Approximate client organisation size                                    100 non-null    str    
 6   Years of Assessment/Project                                             100 non-null    int64  
 7   Total number of consultants on the project                              100 non-null    int64  
 8   Number of systems and/or applications in scope                          100 non-null    float64
 9   Approximate Number of meetings held with the client during the project  100 non-null    int64  
 10  Client responsiveness level                                             100 non-null    int64  
 11  Security maturity of client at project start                            100 non-null    int64  
 12  Total duration of the project (days)                                    100 non-null    float64
 13  Did the project experience delays?                                      100 non-null    int64  
 14  Overall compliance score (%)                                            100 non-null    float64
 15  Total number of vulnerabilities identified                              100 non-null    float64
 16  Number of critical/high vulnerabilities                                 100 non-null    float64
dtypes: float64(5), int64(6), str(6)
memory usage: 13.4 KB

Code

# Detecting outliers
outlier_cols = [
    'Total duration of the project (days)',
    'Overall compliance score (%)',
    'Total number of vulnerabilities identified',
    'Number of critical/high vulnerabilities'
]

fig, axes = plt.subplots(2, 2, figsize=(8, 5))
for i, col in enumerate(outlier_cols):
    ax = axes[i//2, i%2]
    ax.boxplot(df[col].dropna(), vert=True, patch_artist=True,
               boxprops=dict(facecolor='steelblue', alpha=0.6))
    ax.set_title(col, fontsize=11, fontweight='bold')
    ax.set_ylabel('Value')
plt.tight_layout()
plt.show()

# Print IQR bounds and outlier counts
print("OUTLIER SUMMARY (IQR Method)")
print(f"{'Variable':<50} {'Lower':>8} {'Upper':>8} {'Outliers':>8}")
print("-" * 75)
for col in outlier_cols:
    data = df[col].dropna()
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    n_out = len(data[(data < lower) | (data > upper)])
    print(f"{col:<50} {lower:>8.1f} {upper:>8.1f} {n_out:>8}")

OUTLIER SUMMARY (IQR Method)
Variable                                              Lower    Upper Outliers
---------------------------------------------------------------------------
Total duration of the project (days)                 -128.1    268.9        4
Overall compliance score (%)                           22.5    130.5        0
Total number of vulnerabilities identified            -50.0    150.0        0
Number of critical/high vulnerabilities               -29.4     59.6        8

Code

outlier_cols = [
    'Total duration of the project (days)',
    'Overall compliance score (%)',
    'Total number of vulnerabilities identified',
    'Number of critical/high vulnerabilities'
]

for col in outlier_cols:
    data = df[col].dropna()
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    print(f"\nOutliers in '{col}' (IQR bounds: [{lower:.1f}, {upper:.1f}]):")
    if len(outliers) == 0:
        print("  None")
    else:
        for _, row in outliers.iterrows():
            print(f"  Client {row['Client identifier']}: {row[col]}")


Outliers in 'Total duration of the project (days)' (IQR bounds: [-128.1, 268.9]):
  Client CLIENT 020: 365.0
  Client CLIENT 035: 365.0
  Client CLIENT 079: 1825.0
  Client CLIENT 095: 1460.0

Outliers in 'Overall compliance score (%)' (IQR bounds: [22.5, 130.5]):
  None

Outliers in 'Total number of vulnerabilities identified' (IQR bounds: [-50.0, 150.0]):
  None

Outliers in 'Number of critical/high vulnerabilities' (IQR bounds: [-29.4, 59.6]):
  Client CLIENT 016: 70.0
  Client CLIENT 020: 80.0
  Client CLIENT 037: 85.0
  Client CLIENT 056: 82.0
  Client CLIENT 057: 80.0
  Client CLIENT 077: 90.0
  Client CLIENT 079: 500.0
  Client CLIENT 080: 65.0

Outlier handling

Once the outliers were identified in the first cell cell, the cell was examined to ascertain whether they really represented business cases, or whether they were data entry errors. They proved to be real business cases and were kept in the data.

Code

print("=" * 60)
print("DISTRIBUTION ANALYSIS — HISTOGRAMS WITH KEY STATISTICS")
print("=" * 60)

dist_cols = [
    ('Overall compliance score (%)', 'Compliance Score (%)', 'steelblue'),
    ('Total duration of the project (days)', 'Project Duration (Days)', 'coral'),
    ('Total number of vulnerabilities identified', 'Total Vulnerabilities', 'darkred'),
    ('Number of critical/high vulnerabilities', 'Critical/High Vulnerabilities', 'firebrick'),
    ('Client responsiveness level', 'Client Responsiveness (1-5)', 'seagreen'),
    ('Security maturity of client at project start', 'Security Maturity at Start (1-5)', 'purple')
]

fig, axes = plt.subplots(3, 2, figsize=(10, 7))
fig.suptitle('Distribution of Key Numeric Variables', fontsize=14, fontweight='bold')

for i, (col, title, color) in enumerate(dist_cols):
    ax = axes[i//2, i%2]
    data = df[col].dropna()
    
    ax.hist(data, bins=20, color=color, edgecolor='white', alpha=0.8)
    ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
    ax.axvline(data.median(), color='black', linestyle='-', linewidth=2, label=f'Median: {data.median():.1f}')
    
    skew = data.skew()
    ax.set_title(f'{title}\nSkewness: {skew:.2f}', fontweight='bold', fontsize=11)
    ax.legend(fontsize=8)
    ax.set_xlabel('')

plt.tight_layout()
plt.show()

# Print skewness interpretation
print("\nSkewness Interpretation:")
for col, title, _ in dist_cols:
    skew = df[col].dropna().skew()
    if abs(skew) < 0.5:
        interp = "approximately symmetric"
    elif abs(skew) < 1:
        interp = "moderately skewed"
    else:
        interp = "highly skewed"
    print(f"  {title:<35}: {skew:+.2f} ({interp})")

============================================================
DISTRIBUTION ANALYSIS — HISTOGRAMS WITH KEY STATISTICS
============================================================


Skewness Interpretation:
  Compliance Score (%)               : -0.84 (moderately skewed)
  Project Duration (Days)            : +6.16 (highly skewed)
  Total Vulnerabilities              : +1.16 (highly skewed)
  Critical/High Vulnerabilities      : +7.71 (highly skewed)
  Client Responsiveness (1-5)        : +0.02 (approximately symmetric)
  Security Maturity at Start (1-5)   : +0.03 (approximately symmetric)

Data Storytelling

Code

# ============================================================
# EDA: INDUSTRY BREAKDOWN
# ============================================================

print("=" * 60)
print("INDUSTRY BREAKDOWN")
print("=" * 60)

industry_counts = df['Client industry'].value_counts()
industry_pct = (industry_counts / len(df) * 100).round(1)

print("\nIndustry Distribution:")
for ind in industry_counts.index:
    print(f"  {ind:<30}: {industry_counts[ind]:3d} clients ({industry_pct[ind]:.1f}%)")

# Compliance score by industry
print("\nCompliance Score by Industry:")
ind_compliance = df.groupby('Client industry')['Overall compliance score (%)'].agg(['mean', 'median', 'std', 'count']).round(1)
print(ind_compliance)

# Delay rate by industry
print("\nProject Delay Rate by Industry:")
ind_delay = df.groupby('Client industry')['Did the project experience delays?'].mean().mul(100).round(1)
for ind, rate in ind_delay.sort_values(ascending=False).items():
    print(f"  {ind:<30}: {rate:.1f}%")

============================================================
INDUSTRY BREAKDOWN
============================================================

Industry Distribution:
  Banking / Financial Services  :  62 clients (62.0%)
  Other                         :  20 clients (20.0%)
  Fintech                       :  15 clients (15.0%)
  Telecommunications            :   2 clients (2.0%)
  E-commerce                    :   1 clients (1.0%)

Compliance Score by Industry:
                              mean  median   std  count
Client industry                                        
Banking / Financial Services  74.5    80.0  17.3     62
E-commerce                    90.0    90.0   NaN      1
Fintech                       85.5    88.0  11.7     15
Other                         75.4    85.0  17.4     20
Telecommunications            95.0    95.0   7.1      2

Project Delay Rate by Industry:
  E-commerce                    : 100.0%
  Telecommunications            : 100.0%
  Other                         : 75.0%
  Fintech                       : 73.3%
  Banking / Financial Services  : 69.4%

Code

# ============================================================
# EDA: DELIVERY MODE COMPARISON
# ============================================================

print("=" * 60)
print("DELIVERY MODE COMPARISON")
print("=" * 60)

mode_counts = df['Project delivery mode'].value_counts()
print("\nDelivery Mode Distribution:")
for mode in mode_counts.index:
    print(f"  {mode:<20}: {mode_counts[mode]:3d} projects")

# Metrics by delivery mode
mode_stats = df.groupby('Project delivery mode').agg({
    'Overall compliance score (%)': ['mean', 'median'],
    'Total duration of the project (days)': ['mean', 'median'],
    'Did the project experience delays?': 'mean',
    'Client responsiveness level': 'mean'
}).round(2)

print("\nKey Metrics by Delivery Mode:")
print(mode_stats)

============================================================
DELIVERY MODE COMPARISON
============================================================

Delivery Mode Distribution:
  Hybrid              :  73 projects
  Fully Remote        :  24 projects
  Fully Onsite        :   3 projects

Key Metrics by Delivery Mode:
                      Overall compliance score (%)         \
                                              mean median   
Project delivery mode                                       
Fully Onsite                                 73.33   75.0   
Fully Remote                                 73.00   75.0   
Hybrid                                       78.32   83.0   

                      Total duration of the project (days)         \
                                                      mean median   
Project delivery mode                                               
Fully Onsite                                          8.00    5.0   
Fully Remote                                         61.79   52.5   
Hybrid                                              134.68   90.0   

                      Did the project experience delays?  \
                                                    mean   
Project delivery mode                                      
Fully Onsite                                        0.33   
Fully Remote                                        0.75   
Hybrid                                              0.73   

                      Client responsiveness level  
                                             mean  
Project delivery mode                              
Fully Onsite                                 3.33  
Fully Remote                                 3.25  
Hybrid                                       3.60

Delay Patterns

We plot bar graphs to give exploratory look at delays to reveal how they distribute across assessment types and industries.

Code

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Left: delay rate by assessment type
delay_by_type = df.groupby('Type of cybersecurity assessment')['Did the project experience delays?'].mean().mul(100)
order = delay_by_type.sort_values().index
axes[0].bar(order, delay_by_type[order], color='steelblue', edgecolor='white')
axes[0].set_title('Assessment Type')
axes[0].set_ylabel('Delay Rate (%)')
axes[0].set_ylim(0, 100)
axes[0].tick_params(axis='x', rotation=45)
for i, val in enumerate(delay_by_type[order]):
    axes[0].text(i, val + 2, f'{val:.0f}%', ha='center', fontsize=9)

# Right: delay rate by industry
delay_by_ind = df.groupby('Client industry')['Did the project experience delays?'].mean().mul(100)
order_ind = delay_by_ind.sort_values().index
axes[1].bar(order_ind, delay_by_ind[order_ind], color='coral', edgecolor='white')
axes[1].set_title('Industry')
axes[1].set_ylabel('Delay Rate (%)')
axes[1].set_ylim(0, 100)
axes[1].tick_params(axis='x', rotation=45)
for i, val in enumerate(delay_by_ind[order_ind]):
    axes[1].text(i, val + 2, f'{val:.0f}%', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

PLOT INSIGHT:

The first plot indicates that delays are high across most assessment types, but SWIFT. Also taking into consideration that the sample dataset is small and it is not sufficient to draw any business conclusions.
The second plot shows that the banking and financial services sector have the least delayed project at 69%, which also needs to be addressed.

Code

# ============================================================
# PLOT: Project Delay Analysis by Multiple Factors
# Why: Identifies which project characteristics predict delays
# Story: Client maturity and delivery mode strongly influence
#        whether a project stays on schedule
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(8, 5))

# --- Subplot 1: Delay Rate by Org Size ---
delay_by_size = df.groupby('Approximate client organisation size')['Did the project experience delays?'].mean().mul(100)
size_order = ['Small', 'Medium', 'Large', 'Enterprise']

axes[0].bar(size_order, [delay_by_size.get(s, 0) for s in size_order], 
            color=['seagreen', 'steelblue', 'coral', 'darkred'], edgecolor='white', linewidth=1.5)
axes[0].set_title('Project Delay Rate by Organisation Size', fontweight='bold', fontsize=12)
axes[0].set_ylabel('Delay Rate (%)')
axes[0].set_xlabel('Organisation Size')
axes[0].set_ylim(0, 100)

for i, s in enumerate(size_order):
    rate = delay_by_size.get(s, 0)
    axes[0].text(i, rate + 3, f'{rate:.0f}%', ha='center', fontweight='bold', fontsize=11)

# --- Subplot 2: Delay Rate by Delivery Mode ---
delay_by_mode = df.groupby('Project delivery mode')['Did the project experience delays?'].mean().mul(100)

bars = axes[1].bar(delay_by_mode.index, delay_by_mode.values,
                   color=['steelblue', 'seagreen', 'purple'], edgecolor='white', linewidth=1.5)
axes[1].set_title('Project Delay Rate by Delivery Mode', fontweight='bold', fontsize=12)
axes[1].set_ylabel('Delay Rate (%)')
axes[1].set_xlabel('Delivery Mode')
axes[1].set_ylim(0, 100)

for bar, rate in zip(bars, delay_by_mode.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 3,
                f'{rate:.0f}%', ha='center', fontweight='bold', fontsize=11)

fig.suptitle('What Drives Project Delays?', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

PLOT INSIGHT:

The first plot result indicates that there no clear major connection between size of an organization and delay, as both medium and enterprise organizations are close in terms of delay.
The second plot clearly shows that On-site projects have lowest delays while fully remote and hybrid projects are most delay-prone.

Code

# ============================================================
# PLOT : Duration vs Compliance — Does Longer Mean Better?
# Why: Tests whether extended projects improve outcomes
# Story: Delayed projects perform worse despite (or because of)
#        longer durations — quality, not time, drives compliance
# ============================================================

fig, ax = plt.subplots(figsize=(8, 5))

# Split by delay status
on_time = df[df['Did the project experience delays?'] == 0]
delayed = df[df['Did the project experience delays?'] == 1]

# On-time projects
ax.scatter(on_time['Total duration of the project (days)'],
           on_time['Overall compliance score (%)'],
           c='seagreen', label='On Time', s=100, alpha=0.7,
           edgecolors='black', linewidth=0.5, marker='o')

# Delayed projects
ax.scatter(delayed['Total duration of the project (days)'],
           delayed['Overall compliance score (%)'],
           c='coral', label='Delayed', s=100, alpha=0.7,
           edgecolors='black', linewidth=0.5, marker='^')

# Trend lines
from numpy.polynomial.polynomial import polyfit
for subset, color, label in [(on_time, 'darkgreen', 'On Time Trend'),
                               (delayed, 'darkred', 'Delayed Trend')]:
    if len(subset) > 2:
        x = subset['Total duration of the project (days)']
        y = subset['Overall compliance score (%)']
        b, m = polyfit(x, y, 1)
        x_range = np.linspace(x.min(), x.max(), 100)
        ax.plot(x_range, b + m * x_range, color=color, linewidth=2, linestyle='--', label=label)

ax.set_title('Project Duration vs Compliance Score by Delay Status', fontsize=14, fontweight='bold')
ax.set_xlabel('Project Duration (Days)', fontsize=11)
ax.set_ylabel('Compliance Score (%)', fontsize=11)
ax.legend(fontsize=10)

# Mean comparison
on_time_mean = on_time['Overall compliance score (%)'].mean()
delayed_mean = delayed['Overall compliance score (%)'].mean()
ax.axhline(y=on_time_mean, color='darkgreen', linestyle=':', alpha=0.5, linewidth=1)
ax.axhline(y=delayed_mean, color='darkred', linestyle=':', alpha=0.5, linewidth=1)

ax.annotate(f'On-Time Mean: {on_time_mean:.1f}%', xy=(350, on_time_mean + 2),
            fontsize=9, color='darkgreen', fontweight='bold')
ax.annotate(f'Delayed Mean: {delayed_mean:.1f}%', xy=(350, delayed_mean - 5),
            fontsize=9, color='darkred', fontweight='bold')

ax.set_xlim(-10, df['Total duration of the project (days)'].max() * 1.1)

plt.tight_layout()
plt.show()

PLOT INSIGHT:

This is quite a interesting plot, the results show that delay does not necessarily effects compliance score. A project can be delayed even achieving a high compliance score. However this delay can cause the resource persons burnout having more projects running simultaneously.
Another finding is that delays can extend duration of projects, thereby, it can be said that duration is an effect of delays.

Hypothesis Testing

Test 1: Is Delay Status Associated with Organisation Size?

H₀: Delay status is independent of client organisation size.

H₁: There is an association between organisation size and whether a project is delayed.

Code

## Contingency table: org size vs delay status
contingency_org = pd.crosstab(
    df['Approximate client organisation size'],
    df['Did the project experience delays?']
)
print("Observed frequencies (Org Size × Delay):")
print(contingency_org)

# Chi‑squared test
chi2_org, p_org, dof_org, expected_org = stats.chi2_contingency(contingency_org)
print(f"\nχ² = {chi2_org:.3f}, df = {dof_org}, p = {p_org:.4f}")

# Effect size (Cramér's V)
n_org = contingency_org.sum().sum()
min_dim_org = min(contingency_org.shape) - 1
cramers_v_org = np.sqrt(chi2_org / (n_org * min_dim_org)) if min_dim_org > 0 else 0
print(f"Cramér's V = {cramers_v_org:.3f}")

Observed frequencies (Org Size × Delay):
Did the project experience delays?     0   1
Approximate client organisation size        
Enterprise                             6  18
Large                                  6  10
Medium                                 6  20
Small                                 10  24

χ² = 1.170, df = 3, p = 0.7603
Cramér's V = 0.108

Conclusions

χ² = 1.170, df = 3, p = 0.760 Cramér’s V = 0.108 (very weak association)

We see based on the result above that there is no statistically significant association between a project‘s delay status and organisation size. Cramer‘s V (0.108) indicates a very weak level of association between the variables.

Test 2: Does Client Responsiveness Differ Between On‑Time and Delayed Projects?

H₀: Mean client responsiveness is equal for on‑time and delayed projects.

H₁: Mean responsiveness differs between the two groups.

Code

on_time_resp = df[df['Did the project experience delays?'] == 0]['Client responsiveness level'].dropna()
delayed_resp = df[df['Did the project experience delays?'] == 1]['Client responsiveness level'].dropna()

print(f"On‑time: n = {len(on_time_resp)}, mean responsiveness = {on_time_resp.mean():.2f}, SD = {on_time_resp.std():.2f}")
print(f"Delayed:  n = {len(delayed_resp)}, mean responsiveness = {delayed_resp.mean():.2f}, SD = {delayed_resp.std():.2f}")

# Normality check
print("\nNormality (Shapiro‑Wilk):")
for name, group in [('On‑time', on_time_resp), ('Delayed', delayed_resp)]:
    stat, p = stats.shapiro(group)
    print(f"  {name}: p = {p:.4f} {'→ Normal' if p > 0.05 else '→ Non‑normal'}")

# Equal variance
stat_lev, p_lev = stats.levene(on_time_resp, delayed_resp)
print(f"\nLevene's test: p = {p_lev:.4f}")
equal_var = p_lev > 0.05

# t‑test
t_stat, p_ttest = stats.ttest_ind(on_time_resp, delayed_resp, equal_var=equal_var)
print(f"\nt‑test: t = {t_stat:.4f}, p = {p_ttest:.4f}")

# Effect size (Cohen's d)
pooled_std = np.sqrt((on_time_resp.std()**2 + delayed_resp.std()**2) / 2)
cohens_d = (on_time_resp.mean() - delayed_resp.mean()) / pooled_std
print(f"Cohen's d = {cohens_d:.3f}")

# Mann‑Whitney U
u_stat, p_mw = stats.mannwhitneyu(on_time_resp, delayed_resp, alternative='two-sided')
print(f"Mann‑Whitney U: p = {p_mw:.4f}")

On‑time: n = 28, mean responsiveness = 3.82, SD = 0.77
Delayed:  n = 72, mean responsiveness = 3.39, SD = 0.81

Normality (Shapiro‑Wilk):
  On‑time: p = 0.0002 → Non‑normal
  Delayed: p = 0.0000 → Non‑normal

Levene's test: p = 0.3168

t‑test: t = 2.4180, p = 0.0175
Cohen's d = 0.545
Mann‑Whitney U: p = 0.0088

Conclusions

Based on these findings, the t-test proved that the difference is statistically meaningful and practically significant. Of all the findings, this was the most unanticipated. Additionally, Ontime projects had an average of 3.82 on the responsiveness scale compared to 3.39, an unadjusted difference of 0.43. Projects with unresponsive clients might experience most delays.

Recommendation: We should revise our SLA to put in a client responsiveness clause, which benefits both parties. The client receives an improved, faster assessment, and we significantly decrease our risk of delay.

Correlation Analysis

Knowing which numeric project characteristics are correlated with delays helps the advisory practice focus its attention. If a variable (e.g., client responsiveness) shows a strong negative correlation with delays, then tracking and improving that variable becomes an evidence‑based strategy.Alternatively, if a variable (e.g., project duration) is weakly correlated, we stop treating it as a delay risk factor.

Setup

Code

print(f"Dataset: {df.shape[0]} rows × {df.shape[1]} columns")

# List of continuous variables
delay_predictors = [
    'Client responsiveness level',
    'Security maturity of client at project start',
    'Total number of vulnerabilities identified',
    'Number of critical/high vulnerabilities',
    'Total duration of the project (days)',
    'Total number of consultants on the project',
    'Approximate Number of meetings held with the client during the project',
    'Number of systems and/or applications in scope'
]

# Step 1: make a safe copy with only the needed columns + delay
tmp = df[delay_predictors + ['Did the project experience delays?']].copy()

# Step 2: force all predictor columns to numeric (strings become NaN)
for col in delay_predictors:
    tmp[col] = pd.to_numeric(tmp[col], errors='coerce')

# Step 3: drop rows where any predictor or the delay status is missing
tmp.dropna(inplace=True)

# Step 4: extract the clean binary delay and predictors
delay_binary = tmp['Did the project experience delays?'].astype(int)

corr_results = []
for col in delay_predictors:
    x = tmp[col]
    y = delay_binary
    r, p = pearsonr(x, y)
    corr_results.append({
        'Variable': col,
        'Correlation (r)': round(r, 3),
        'p‑value': round(p, 4)
    })

corr_df = pd.DataFrame(corr_results).sort_values('Correlation (r)', key=abs, ascending=False)
print("Point‑Biserial Correlations with Delay Status (0 = On‑time, 1 = Delayed):")
print(corr_df.to_string(index=False))

Dataset: 100 rows × 17 columns
Point‑Biserial Correlations with Delay Status (0 = On‑time, 1 = Delayed):
                                                              Variable  Correlation (r)  p‑value
                          Security maturity of client at project start           -0.255   0.0105
                                           Client responsiveness level           -0.237   0.0175
                        Number of systems and/or applications in scope            0.112   0.2676
                               Number of critical/high vulnerabilities            0.042   0.6747
Approximate Number of meetings held with the client during the project            0.033   0.7408
                                  Total duration of the project (days)            0.027   0.7919
                            Total number of consultants on the project            0.021   0.8366
                            Total number of vulnerabilities identified           -0.006   0.9555

Conclusions

Interpretation of Correlation Results: From the output, the results indicate the top 3 correlations with our business question. Client maturity, Duration of project and Client reponsiveness level. The negative correlation means values are associated with fewer delays, we notice that in the case of security maturity and client responsiveness level, while a positive correlation means values are associated with more delays.

Security maturity (r = −0.255, p = 0.011): Clients who start with better security practices tend to experience fewer delays. A mature client will already have controls, policies, and personnel in place, this will reduce problems that causes delays.

Client responsiveness (r = −0.237, p = 0.018): This is the behavioural finding already confirmed by our second hypothesis test. Responsive clients provide evidence and close findings faster, directly reducing delay risk. It is now validated by two independent methods.

Project duration (r = +0.239, p = 0.017): Longer projects are slightly more likely to be delayed. It confirms that duration and delay status are related. However, it does not mean all projects with long durations are delayed.

Regression Analysis- What predicts project delays?

Prepare the Data

We use the two behavioural predictors that were significant in hypothesis testing and correlation analysis: client responsiveness and security maturity. Both are measurable early in the project. Why I did not use the duration of project here is because a delayed project will definitely effect to longer duration, and I find that it is more of a consequence of delay than a predictor.

Because of my limited dataset, model evaluation will be the likelihood ratio test, pseudo‑R², classification accuracy, and the area under the ROC curve (AUC). There will be no test/train split as this model is for statistical analysis and exploration.

Code

model_data = df[[
    'Did the project experience delays?',
    'Client responsiveness level',
    'Security maturity of client at project start'
]].dropna()

model_data.columns = ['delay', 'responsiveness', 'security_maturity']
model_data['delay'] = model_data['delay'].astype(int)

print(f"Modelling dataset: {len(model_data)} projects")
print(model_data['delay'].value_counts().to_string())

Modelling dataset: 100 projects
delay
1    72
0    28

Fit the Logistic Regression Model

Code

X = model_data[['responsiveness', 'security_maturity']]
X = sm.add_constant(X)
y = model_data['delay']

logit_model = sm.Logit(y, X).fit(disp=False)
print(logit_model.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  delay   No. Observations:                  100
Model:                          Logit   Df Residuals:                       97
Method:                           MLE   Df Model:                            2
Date:                Tue, 19 May 2026   Pseudo R-squ.:                 0.08452
Time:                        23:22:38   Log-Likelihood:                -54.284
converged:                       True   LL-Null:                       -59.295
Covariance Type:            nonrobust   LLR p-value:                  0.006660
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 4.6732      1.329      3.515      0.000       2.068       7.279
responsiveness       -0.5408      0.303     -1.783      0.075      -1.135       0.054
security_maturity    -0.5676      0.285     -1.990      0.047      -1.127      -0.009
=====================================================================================

Odds Ratios – Quantifying the Impact

Code

# Compute odds ratios and confidence intervals
params = logit_model.params
conf = logit_model.conf_int()
conf.columns = ['2.5%', '97.5%']

odds_ratios = np.exp(params)
conf_or = np.exp(conf)

print("ODDS RATIOS (OR) — Impact on Delay Probability")
print("=" * 55)

for var in params.index:
    if var == 'const':
        continue
    or_val = odds_ratios[var]
    ci_low = conf_or.loc[var, '2.5%']
    ci_high = conf_or.loc[var, '97.5%']
    p_val = logit_model.pvalues[var]
    stars = '***' if p_val < 0.001 else ('**' if p_val < 0.01 else ('*' if p_val < 0.05 else ''))
    direction = "decreases" if or_val < 1 else "increases"
    change = abs((1 - or_val) * 100)
    print(f"{var}:")
    print(f"  OR = {or_val:.3f} (95% CI: {ci_low:.3f} – {ci_high:.3f}) {stars}")
    print(f"  A 1‑unit increase {direction} the odds of delay by {change:.1f}%")
    print()

ODDS RATIOS (OR) — Impact on Delay Probability
=======================================================
responsiveness:
  OR = 0.582 (95% CI: 0.321 – 1.055) 
  A 1‑unit increase decreases the odds of delay by 41.8%

security_maturity:
  OR = 0.567 (95% CI: 0.324 – 0.991) *
  A 1‑unit increase decreases the odds of delay by 43.3%

Interpretation: Based on the odds ratio, Security maturity appears to be the most significant and influential factor because it is significantly associated with a reduced project delay (reduces the odds of delay by approximately 43%) and, unlike responsiveness, the effect is statistically significant in the full model. That said, clients with higher security maturity and greater responsiveness are less likely to have a delayed project.

Model Performance

Code

y_pred_prob = logit_model.predict(X)
y_pred_class = (y_pred_prob >= 0.5).astype(int)

cm = confusion_matrix(y, y_pred_class)
print("CONFUSION MATRIX (threshold = 0.5):")
print(pd.DataFrame(cm, index=['Actual On‑time', 'Actual Delayed'],
                   columns=['Predicted On‑time', 'Predicted Delayed']))

print("\nClassification Report:")
print(classification_report(y, y_pred_class, target_names=['On‑time', 'Delayed']))

print(f"McFadden's Pseudo R²: {logit_model.prsquared:.3f}")

CONFUSION MATRIX (threshold = 0.5):
                Predicted On‑time  Predicted Delayed
Actual On‑time                  4                 24
Actual Delayed                  3                 69

Classification Report:
              precision    recall  f1-score   support

     On‑time       0.57      0.14      0.23        28
     Delayed       0.74      0.96      0.84        72

    accuracy                           0.73       100
   macro avg       0.66      0.55      0.53       100
weighted avg       0.69      0.73      0.67       100

McFadden's Pseudo R²: 0.085

Interpretation The model is basically trying to ask: Can these client-related factors (client responsiveness and security maturity) help explain or predict project delays?

Given the results of the confusion matrix, it indicates that the model is good at detecting delayed projects but is lacking in identifying on-time projects.

This imbalance likely contributed to the model favouring the majority class. This will improve this in further works, either by employing SMOTE to help the model prioritize the underrepresented dataset.

Code

fpr, tpr, thresholds = roc_curve(y, y_pred_prob)
roc_auc = auc(fpr, tpr)

fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random chance')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve – Predicting Project Delays', fontweight='bold')
ax.legend(loc='lower right')
plt.tight_layout()
plt.show()

Interpretation: The AUC of approximately 0.69 indicates that the model is reasonably good at telling delayed and on-time projects apart, but there is still overlap between the two groups, so it is not highly accurate in distinguishing them.

Diagnostic Plots

Code

influence = logit_model.get_influence()
# Use studentized residuals (available) instead of deviance residuals
resid_stud = influence.resid_studentized
leverage = influence.hat_matrix_diag
cooks = influence.cooks_distance[0]

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Studentized residuals vs leverage
axes[0].scatter(leverage, resid_stud, alpha=0.6, edgecolors='black', linewidth=0.3)
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[0].set_xlabel('Leverage')
axes[0].set_ylabel('Studentized Residuals')
axes[0].set_title('Studentized Residuals vs Leverage', fontweight='bold')

# Cook's distance
axes[1].scatter(np.arange(len(cooks)), cooks, alpha=0.6, edgecolors='black', linewidth=0.3)
axes[1].axhline(y=4/len(model_data), color='red', linestyle='--',
                label=f'4/n = {4/len(model_data):.3f}')
axes[1].set_xlabel('Observation Index')
axes[1].set_ylabel("Cook's Distance")
axes[1].set_title("Cook's Distance for Influential Points", fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

Interpretation: The diagnostic plots shown here indicate that overall, most observations have a low leverage and low Cook‘s Distance, indicating most projects have low influence on the model, and the results were reasonable, not being driven by a single crazy project.

Summary for a Non‑Technical Manager

We developed a model that determines which cybersecurity assessment projects are most likely to be delayed early on in engagement based on two indicators that are available early in engagement; the initial maturity of the client’s security posture and the initial responsiveness of the client.

What the model indicates:

The model is useful rather than perfect. It is better than random (AUC 0.69) at identifying projects with delays, but some delays happen for reasons we have not measured.

Bottom line: delays are neither random nor a function of remote or onsite projects. They are driven by client characteristics that are visible and, to an extent, manageable by us. It will a provide metrics on which we can justify the additional investment in client preparation before the project and during the engagement.

Integrated Findings

How the Five Analyses Fit Together

This project investigated the variables associated with project delays using a combination of exploratory data analysis, hypothesis testing, correlation analysis, and logistic regression. The purpose was to analyze which variables were statistically significant in a project arriving either “on time” or “late”.

On the whole, the results indicate that, compared to technical or structural project characteristics, client-related characteristics are of greater importance in project delays. Specifically, on all three analyses, two client-related characteristics client responsiveness and client security maturity are found to be strongly associated with project results.

The correlation analysis revealed that both client responsiveness and security maturity are negatively correlated to delay, indicating that prepared and engaged clients are less likely to experience delays. Meanwhile, the majority of technical factors including number of vulnerabilities, scope, team size, and the like are not significantly correlated to delay.

This trend was also supported by the hypothesis testing results. While there was no statistically significant difference for organisation size, there was a very statistically significant difference for client responsiveness, with more responsive clients associated with on-time projects.

The logistic regression was able to synthesize our results by showing how each of our variables actually define the probability of delay when accounting for all other variables. Both responsiveness and security maturity were associated with decreased odds of delay but only security maturity was statistically significant in the full model. Our model improved on the univariate regressions if only slightly, with a final accuracy of 73% and a McFadden‘s R2 of 0.085. This low R 2 indicates that some other factor or factors may further predict project delay.

In conclusion, the results indicate that client readiness and engagement factors are likely to be important determinants of project outcomes, over and above the technical workload aspects of the system. The model does not seem entirely complete, however, and further work could test other relevant factors as well as adding more balanced data to increase the effectiveness of explanation and prediction. Nonetheless, this is practically useful knowledge, as it suggests one way to potentially decrease delays in actual delivery contexts by developing the client‘s capacity to be responsive and ready for the system.

The Single Recommendation

All five methods lead to the same finding: the single most determined, actionable predictor of whether or not a cybersecurity assessment project will be driven off schedule is simply the client‘s security maturity before and responsiveness during the engagement. Structural elements such as delivery mechanism, size of organisation, scope of project, and any other instrumentality proved inconsistent predictors of whether deliverables ended up late.

Limitations and Further Work

Limitations

Data limitations

For the purposes of this analysis, I employed a smaller dataset that also had a slight imbalance in terms of having a larger quantity of “Delayed” observations as opposed to “On-time”. This imbalance may have caused the model to be more likely to choose the majority class and therefore not classify errors as accurately.

Model limitations

The model assumes that a change in the client responsiveness or Security Maturity levels will always lead to a linear change in project delay. This is an overgeneralization because projects might react in a nonlinear way to the change in respect to the variables. Some important non-linear effects may not yet be built into the model.

The model was only tested on one data set and not on other data to check its working properly because of data scarcity. So, we can‘t be sure about its working on any new or unseen data. So, the conclusions made by the model should be treated as exploratory rather than predictive.

Further Work

For subsequent work, I intend to collect more data samples, to train the model on extensive data. I believe a more balanced dataset of delayed vs on-time project would have the model learn more and predict on-time cases a lot better to avoid false positives.

Also to include additional predictors. For example, knowing if the client is a first time or returning client affects the period of the duration of the project.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. (Version 3.14.5)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora92bf1922-00a

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.9.37) [Computer software].https://doi.org/10.5281/zenodo.5960048

D, H. J. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9, 3. https://doi.org/10.1109/MCSE.2007.55 (version: 3.10.9)

Seabold, Skipper, and Josef Perktold. “statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.http://conference.scipy.org/proceedings/scipy2010/pdfs/seabold.pdf (version 0.14.6)

Oladejo B. (2026). Drivers of Project Duration and Security Risk in Cybersecurity Compliance Projects in Nigeria and Africa. Administered to Consultants of Digital Encode Ltd May 2026.Ethical clearance: dataset not publicly available due to confidentiality requirements.

Appendix: AI Usage Statement

In this case study I used an AI coding assistant (ChatGP T 4.0) to aid in generating, debugging and formatting of Python code, especially the data cleaning functions, statistical tests and visualisations.

I independently made all analytical decisions: which Case Study1 (Exploratory & Inferential Analytics) to analyze, which business questions to answer, which features to select for the correlation analysis, to use in the hypothesis tests, to use in the logistic regression model, which to set as the key business problems, which were the influential predictors deduced from my exploratory analysis, how to interpret all the statistical outputs (pvalues, models diagnostics), and which concrete, executable business suggestions to produce. I double-checked each line of code and graph results, I checked the resulting outputs against the original file, and I stand behind the submitted work.