Predicting Active Employment Outcomes in the Employment Skills Programme (ESP)

1. Executive Summary

This study analyses data from a State in Nigeria’s Employment Skills Programme, a vocational training and job placement initiative that enrolled 2,006 participants across four batches between June 2021 and November 2022. The central business question is: which participant profiles and training categories are most likely to result in active employment, and how can this knowledge make batch design, trainer allocation, and placement partnership decisions more data-driven?

Data were extracted from the programme’s internal monitoring and evaluation database, covering 25 variables including demographic characteristics, training track, mentoring status, certification outcome, placement status, and final employment status. Five analytical techniques were applied: Exploratory Data Analysis (EDA), Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression. Key findings reveal that training category, educational level, and mentoring status are statistically significant predictors of active employment. Participants in Information Technology and Business Support tracks, those with tertiary education, and those who received mentoring consistently show higher rates of active employment. The integrated recommendation is that future batches should prioritise mentoring coverage for all participants and increase investment in IT and Business Support training tracks, while targeted support should be designed for secondary-school-educated participants to close the employment gap.

2. Professional Disclosure

Job Title: Director of Programmes
Organisation/Sector: PaTiTi Consulting | Skills Development / Social Impact

Exploratory Data Analysis (EDA): As Director of Programmes, understanding who is enrolling in the programme before making design decisions is foundational. EDA allows me to monitor whether the programme is reaching its intended demographic — young, low-income Lagos residents — and to detect shifts in participant profiles across batches that would require a programmatic response. Without this diagnostic layer, decisions on recruitment targeting and resource allocation are based on assumption rather than evidence.

Visualisation: Reporting to funders, government partners, and the board requires communicating programme performance clearly and efficiently. Visualisation of placement rates by training category, gender, and LGA gives me a dashboard-ready view that directly informs which tracks to scale, discontinue, or redesign in future batches. It also enables me to present evidence to partners in a form that drives decisions rather than simply informs them.

Hypothesis Testing: A recurring operational question I face is whether observed differences in outcomes — for example, whether mentored participants place at higher rates than unmentored ones, or whether one training category outperforms another — are statistically real or simply sampling noise. Hypothesis testing provides the rigorous basis needed to act on those differences, justify resource reallocation, and defend programmatic changes to stakeholders.

Correlation Analysis: As Director, I allocate mentoring resources and decide which participant segments receive intensive support. Correlation analysis between age, educational level, training duration, and employment outcomes allows me to test whether those targeting assumptions are evidence-based or inherited from previous programme designs. Understanding which variables move together informs more precise targeting in future batches.

Logistic Regression: The most consequential decision I make each cycle is which applicant profiles to prioritise during batch recruitment. A logistic regression model identifying the strongest predictors of active employment provides a defensible, data-driven scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of systematic programme improvement.

3. Data Collection & Sampling

Source: Employment Skills Programme (ESP) internal monitoring and evaluation (M&E) database, maintained by the Directorate of Programmes.

Collection Method: The dataset was extracted directly from the programme’s management information system (MIS) by the Director of Programmes. No third-party data was used. The extract covers all enrolled participants across Batch 1 through Batch 4, including a sub-cohort designated Batch 4.2.

Sampling Frame: The full population of ESP participants registered between June 2021 and November 2022. This is a census extract, not a sample — every enrolled participant is represented, making it a complete administrative dataset.

Sample Size: 2,006 observations (participants) across 25 variables.

Time Period: June 2021 to November 2022 (approximately 17 months), covering four programme batches.

Variables: The dataset includes batch identifier, disability status (PWD), mentoring status, enrollment ID, qualification, educational level, age, age bracket, returnee status, LGA of residence, marital status, gender, training category and sub-category, programme start and end dates, training status, placement status, company placed with, reason if declined, date of employment, employment type, employment status, and job formality classification.

Ethical & Consent Notes: All participant data was collected with informed consent as part of the programme enrolment process. Personally identifiable information (names, national ID numbers, phone numbers) has been removed from this analytical dataset. Enrollment IDs serve as anonymised participant identifiers. The dataset is stored on an encrypted organisational drive and is not shared beyond the immediate analytical team. This analysis has been approved for academic purposes under the programme’s data governance policy.

Data Quality Issues Identified: 1. Missing Employment Status (350 rows): Approximately 17% of participants have no recorded employment status. These are predominantly participants still within the programme cycle at the time of data extraction or those who became unreachable during follow-up. These records are excluded from the regression and hypothesis testing analyses but retained in EDA counts, with the exclusion documented. 2. Age Outlier: One participant is recorded as age 1, which is clearly a data entry error. This record is excluded from age-related analyses.

4. Data Description

Code

# Load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(kableExtra)
library(lubridate)

# Load data
df <- read.csv("Data.csv", 
               skip = 1, 
               fileEncoding = "latin1",
               stringsAsFactors = FALSE)

# Clean column names
colnames(df) <- c("Batch", "PWD", "Mentoring", "Enrollment_ID", "Qualification",
                  "Educational_Level", "Age", "Age_Bracket", "Returnee",
                  "Previous_Country", "LGA", "Marital_Status", "Gender",
                  "Training_Category", "Training_Sub_Category", "Start_Date",
                  "End_Date", "Training_Status", "Placement_Status", "Company_Name",
                  "Reason_Declined", "Date_Employment", "Employment_Type",
                  "Employment_Status", "Job_Tagged")

# Parse dates
df$Start_Date <- dmy(df$Start_Date)
df$End_Date   <- dmy(df$End_Date)
df$Duration_Days <- as.numeric(df$End_Date - df$Start_Date)

# Remove obvious age error
df <- df %>% filter(is.na(Age) | Age > 5)

# Summary table of key variables
df %>%
  select(Age, Duration_Days) %>%
  summary() %>%
  kable(caption = "Summary Statistics — Numeric Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Summary Statistics — Numeric Variables
	Age	Duration_Days
	Min. :17.00	Min. : 17.00
	1st Qu.:25.00	1st Qu.: 28.00
	Median :30.00	Median : 56.00
	Mean :30.01	Mean : 56.55
	3rd Qu.:35.00	3rd Qu.: 85.00
	Max. :66.00	Max. :109.00
	NA's :1	NA's :5

Code

# Categorical summaries
cat_summary <- data.frame(
  Variable = c("Gender", "Educational Level", "Training Status",
               "Placement Status", "Employment Status", "Mentoring"),
  Categories = c(
    paste(names(table(df$Gender)), collapse = " | "),
    paste(names(table(df$Educational_Level)), collapse = " | "),
    paste(names(table(df$Training_Status)), collapse = " | "),
    paste(names(table(df$Placement_Status)), collapse = " | "),
    paste(names(table(df$Employment_Status)), collapse = " | "),
    paste(names(table(df$Mentoring)), collapse = " | ")
  ),
  N_Valid = c(
    sum(!is.na(df$Gender)),
    sum(!is.na(df$Educational_Level)),
    sum(!is.na(df$Training_Status)),
    sum(!is.na(df$Placement_Status)),
    sum(!is.na(df$Employment_Status)),
    sum(!is.na(df$Mentoring))
  )
)

cat_summary %>%
  kable(caption = "Categorical Variable Overview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Categorical Variable Overview
Variable	Categories	N_Valid
Gender	Female \| Male	2005
Educational Level	Secondary \| Tertiary	2005
Training Status	Certified \| Dropped \| Not Certified	2005
Placement Status	Available \| Not Available \| Placed \| Self Employed	2005
Employment Status	\| Active \| Inactive	2005
Mentoring	Mentored \| Not Mentored	2005

Code

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv("Data.csv", skiprows=1, encoding="latin1")
df.columns = ["Batch","PWD","Mentoring","Enrollment_ID","Qualification",
              "Educational_Level","Age","Age_Bracket","Returnee",
              "Previous_Country","LGA","Marital_Status","Gender",
              "Training_Category","Training_Sub_Category","Start_Date",
              "End_Date","Training_Status","Placement_Status","Company_Name",
              "Reason_Declined","Date_Employment","Employment_Type",
              "Employment_Status","Job_Tagged"]

# Parse dates and calculate duration
df["Start_Date"] = pd.to_datetime(df["Start_Date"], dayfirst=True, errors="coerce")
df["End_Date"]   = pd.to_datetime(df["End_Date"],   dayfirst=True, errors="coerce")
df["Duration_Days"] = (df["End_Date"] - df["Start_Date"]).dt.days

# Remove age outlier
df = df[df["Age"] > 5]

# Numeric summary
numeric_summary = df[["Age","Duration_Days"]].describe().round(2)
print("=== Numeric Variable Summary ===")

=== Numeric Variable Summary ===

Code

print(numeric_summary.to_string())

           Age  Duration_Days
count  2004.00        1999.00
mean     30.01          56.54
std       6.66          30.32
min      17.00          17.00
25%      25.00          28.00
50%      30.00          56.00
75%      35.00          85.00
max      66.00         109.00

Code

# Categorical summary
print("\n=== Categorical Variable Counts ===")


=== Categorical Variable Counts ===

Code

for col in ["Gender","Educational_Level","Training_Status","Placement_Status",
            "Employment_Status","Mentoring"]:
    print(f"\n{col}:")
    print(df[col].value_counts(dropna=False).to_string())


Gender:
Gender
Female    1233
Male       771

Educational_Level:
Educational_Level
Tertiary     1322
Secondary     682

Training_Status:
Training_Status
Certified        1869
Not Certified     121
Dropped            14

Placement_Status:
Placement_Status
Placed           1708
Not Available     139
Available          99
Self Employed      58

Employment_Status:
Employment_Status
Active      925
Inactive    729
NaN         350

Mentoring:
Mentoring
Mentored        1318
Not Mentored     686

Interpretation: The dataset covers 2,006 participants with a mean age of approximately 30 years, reflecting a young workforce target population. Training programmes ran for an average of 57 days (range: 17–109 days), indicating variability across training tracks. Female participants (62%) outnumber male participants (38%). The majority hold tertiary qualifications (66%) and 93% achieved certification. However, only 46% of those with a recorded employment status are actively employed at the time of data extraction, pointing to a significant post-placement retention or engagement challenge that this analysis seeks to understand.

5. Analysis — Technique 1: Exploratory Data Analysis

Theory: Exploratory Data Analysis (EDA) is the practice of summarising a dataset’s main characteristics, often using statistical and visual methods, before formal modelling. It is grounded in the tradition established by Tukey (1977) and is used to detect patterns, anomalies, and relationships that guide subsequent analysis. EDA is particularly important for administrative datasets where data quality issues are common and assumptions about the population may not hold.

Business Justification: As Director of Programmes, EDA provides the foundation for all downstream decisions. Before allocating resources, adjusting training tracks, or redesigning batch recruitment, I need to know who is actually in the programme, how they are distributed across demographic categories, and where the data has gaps. EDA surfaces these facts systematically rather than relying on anecdotal reports from field staff.

Code

library(ggplot2)
library(scales)

# Age distribution
p1 <- ggplot(df %>% filter(!is.na(Age)), aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#2C7BB6", colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = mean(Age, na.rm=TRUE)), 
             colour = "#D7191C", linetype = "dashed", linewidth = 1) +
  labs(title = "Age Distribution of ESP Participants",
       subtitle = "Dashed line = mean age (≈30 years)",
       x = "Age (years)", y = "Count") +
  theme_minimal(base_size = 13)
print(p1)

Code

# Certification rate by batch
cert_batch <- df %>%
  group_by(Batch, Training_Status) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Batch) %>%
  mutate(pct = n / sum(n) * 100)

p2 <- ggplot(cert_batch, aes(x = Batch, y = pct, fill = Training_Status)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = c("Certified" = "#1A9641", 
                                "Not Certified" = "#FDAE61",
                                "Dropped" = "#D7191C")) +
  labs(title = "Training Status by Batch",
       x = "Batch", y = "Percentage (%)", fill = "Status") +
  theme_minimal(base_size = 13)
print(p2)

Code

# Placement status distribution
placement_counts <- df %>%
  count(Placement_Status) %>%
  mutate(pct = n / sum(n) * 100)

p3 <- ggplot(placement_counts, aes(x = reorder(Placement_Status, -n), y = pct, fill = Placement_Status)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("Placed"="#1A9641","Not Available"="#D7191C",
                                "Available"="#FDAE61","Self Employed"="#2C7BB6")) +
  geom_text(aes(label = paste0(round(pct,1),"%")), vjust = -0.5, size = 4) +
  labs(title = "Placement Status of ESP Participants",
       x = "Placement Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p3)

Code

# Employment status breakdown
emp_counts <- df %>%
  filter(!is.na(Employment_Status)) %>%
  count(Employment_Status) %>%
  mutate(pct = n/sum(n)*100)

p4 <- ggplot(emp_counts, aes(x = Employment_Status, y = pct, fill = Employment_Status)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("Active"="#1A9641","Inactive"="#D7191C")) +
  geom_text(aes(label = paste0(round(pct,1),"%")), vjust = -0.5, size = 4) +
  labs(title = "Employment Status Among Placed Participants",
       subtitle = "350 participants with no recorded employment status excluded",
       x = "Employment Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p4)

Code

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("EDA — ESP Programme Overview", fontsize=15, fontweight="bold", y=1.01)

# Age distribution
ax = axes[0,0]
age_clean = df["Age"].dropna()
ax.hist(age_clean, bins=15, color="#2C7BB6", edgecolor="white", alpha=0.85)
ax.axvline(age_clean.mean(), color="#D7191C", linestyle="--", linewidth=1.5)
ax.set_title("Age Distribution")
ax.set_xlabel("Age (years)")
ax.set_ylabel("Count")

# Certification by Batch
ax = axes[0,1]
cert_pivot = pd.crosstab(df["Batch"], df["Training_Status"], normalize="index") * 100
cert_pivot[["Certified","Not Certified","Dropped"]].plot(
    kind="bar", stacked=True, ax=ax,
    color=["#1A9641","#FDAE61","#D7191C"])
ax.set_title("Training Status by Batch (%)")
ax.set_xlabel("Batch")
ax.set_ylabel("Percentage")
ax.legend(loc="lower right", fontsize=8)
ax.tick_params(axis="x", rotation=30)

# Placement Status
ax = axes[1,0]
pl_counts = df["Placement_Status"].value_counts()
pl_pct = pl_counts / pl_counts.sum() * 100
colors = {"Placed":"#1A9641","Not Available":"#D7191C",
          "Available":"#FDAE61","Self Employed":"#2C7BB6"}
bars = ax.bar(pl_pct.index, pl_pct.values,
              color=[colors.get(x,"grey") for x in pl_pct.index])
for bar, val in zip(bars, pl_pct.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Placement Status (%)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")
ax.tick_params(axis="x", rotation=15)

# Employment Status
ax = axes[1,1]
emp = df["Employment_Status"].dropna().value_counts()
emp_pct = emp / emp.sum() * 100
bars = ax.bar(emp_pct.index, emp_pct.values,
              color=["#1A9641" if x=="Active" else "#D7191C" for x in emp_pct.index])
for bar, val in zip(bars, emp_pct.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Employment Status (Recorded Only)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")

plt.tight_layout()
plt.savefig("eda_overview.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("EDA plots rendered.")

EDA plots rendered.

Interpretation: The EDA reveals four critical programme facts. First, the participant population is young (mean age ≈ 30 years), consistent with the programme’s youth employment mandate, though ages range up to 66, suggesting some adult re-skilling participation. Second, certification rates are high across all batches (above 90%), indicating that training delivery is effective at getting participants to completion. Third, only 7% of participants remain unplaced, meaning the placement infrastructure is functioning — but the more pressing issue is retention: among those with a recorded employment status, 55% are active and 45% are inactive. This active employment gap is the central problem this analysis seeks to explain.

6. Analysis — Technique 2: Visualisation

Theory: Data visualisation translates analytical findings into perceptual representations that allow patterns to be understood more rapidly than through tables alone. Effective visualisation for programme analytics follows the principle of encoding the most important comparison as the primary visual channel (Cleveland & McGill, 1984). For categorical outcome data such as employment status, bar charts with percentage encoding are the most reliable form.

Business Justification: As Director of Programmes, visualisation is the primary tool for communicating performance to funders, government partners, and the board. The five plots below are designed to answer the five most frequently asked questions in programme review meetings: which training track performs best, does gender affect outcomes, does education level matter, which LGAs are underperforming, and has performance improved across batches?

Code

# Active employment rate by Training Category
active_cat <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Training_Category) %>%
  summarise(
    Total = n(),
    Active = sum(Employment_Status == "Active"),
    Rate = Active / Total * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(Rate))

p1 <- ggplot(active_cat, aes(x = reorder(Training_Category, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate,1),"%")), hjust = -0.1, size = 3.8) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Active Employment Rate by Training Category",
       x = "Training Category", y = "Active Employment Rate (%)",
       fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p1)

Code

# Active employment rate by Gender
active_gender <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Gender) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p2 <- ggplot(active_gender, aes(x = Gender, y = Rate, fill = Gender)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate,1),"%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Female" = "#E66101", "Male" = "#2C7BB6")) +
  labs(title = "Active Employment Rate by Gender",
       x = "Gender", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p2)

Code

# Active employment rate by Educational Level
active_edu <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Educational_Level) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p3 <- ggplot(active_edu, aes(x = Educational_Level, y = Rate, fill = Educational_Level)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate,1),"%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Secondary" = "#FDAE61", "Tertiary" = "#1A9641")) +
  labs(title = "Active Employment Rate by Educational Level",
       x = "Educational Level", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p3)

Code

# Top 10 LGAs by active employment rate
active_lga <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(LGA) %>%
  summarise(
    Total = n(),
    Rate = mean(Employment_Status == "Active") * 100,
    .groups = "drop"
  ) %>%
  filter(Total >= 20) %>%
  arrange(desc(Rate)) %>%
  slice_head(n = 10)

p4 <- ggplot(active_lga, aes(x = reorder(LGA, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate,1),"%")), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Top 10 LGAs by Active Employment Rate",
       subtitle = "LGAs with ≥20 participants shown",
       x = "LGA", y = "Active Employment Rate (%)", fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p4)

Code

# Active employment rate trend across batches
active_batch <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Batch) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p5 <- ggplot(active_batch, aes(x = Batch, y = Rate, group = 1)) +
  geom_line(colour = "#2C7BB6", linewidth = 1.2) +
  geom_point(size = 4, colour = "#2C7BB6") +
  geom_text(aes(label = paste0(round(Rate,1),"%")), vjust = -1, size = 4) +
  labs(title = "Active Employment Rate Trend Across Batches",
       x = "Batch", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p5)

Code

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

df_emp = df[df["Employment_Status"].notna()].copy()
df_emp["Active_bin"] = (df_emp["Employment_Status"] == "Active").astype(int)

fig, axes = plt.subplots(3, 2, figsize=(14, 16))
fig.suptitle("Visualisation — Employment Outcomes by Key Dimensions", 
             fontsize=15, fontweight="bold")

# 1. By Training Category
ax = axes[0,0]
cat_rate = df_emp.groupby("Training_Category")["Active_bin"].mean().sort_values() * 100
colors = cm.RdYlGn(np.linspace(0.2, 0.85, len(cat_rate)))
bars = ax.barh(cat_rate.index, cat_rate.values, color=colors)
for bar, val in zip(bars, cat_rate.values):
    ax.text(val+0.5, bar.get_y()+bar.get_height()/2,
            f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("Active Employment Rate\nby Training Category")
ax.set_xlabel("Active Employment Rate (%)")
ax.set_xlim(0, 80)

(0.0, 80.0)

Code

# 2. By Gender
ax = axes[0,1]
gen_rate = df_emp.groupby("Gender")["Active_bin"].mean() * 100
bars = ax.bar(gen_rate.index, gen_rate.values,
              color=["#E66101","#2C7BB6"], width=0.4)
for bar, val in zip(bars, gen_rate.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Gender")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

# 3. By Educational Level
ax = axes[1,0]
edu_rate = df_emp.groupby("Educational_Level")["Active_bin"].mean() * 100
bars = ax.bar(edu_rate.index, edu_rate.values,
              color=["#FDAE61","#1A9641"], width=0.4)
for bar, val in zip(bars, edu_rate.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Educational Level")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

# 4. Top LGAs
ax = axes[1,1]
lga_rate = (df_emp.groupby("LGA")
            .agg(Total=("Active_bin","count"), Rate=("Active_bin","mean"))
            .query("Total >= 20"))
lga_rate["Rate"] = lga_rate["Rate"] * 100
lga_top = lga_rate.sort_values("Rate", ascending=True).tail(10)
colors2 = cm.RdYlGn(np.linspace(0.2, 0.85, len(lga_top)))
bars = ax.barh(lga_top.index, lga_top["Rate"], color=colors2)
for bar, val in zip(bars, lga_top["Rate"]):
    ax.text(val+0.5, bar.get_y()+bar.get_height()/2,
            f"{val:.1f}%", va="center", fontsize=8)
ax.set_title("Top 10 LGAs by Active\nEmployment Rate (n≥20)")
ax.set_xlabel("Rate (%)")
ax.set_xlim(0, 80)

(0.0, 80.0)

Code

# 5. Trend across batches
ax = axes[2,0]
batch_rate = df_emp.groupby("Batch")["Active_bin"].mean() * 100
ax.plot(batch_rate.index, batch_rate.values, marker="o", 
        color="#2C7BB6", linewidth=2)
for x, y in zip(batch_rate.index, batch_rate.values):
    ax.text(x, y+1, f"{y:.1f}%", ha="center", fontsize=9)
ax.set_title("Active Employment Rate\nTrend Across Batches")
ax.set_xlabel("Batch")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 80)

(0.0, 80.0)

Code

ax.tick_params(axis="x", rotation=20)

# 6. By Mentoring
ax = axes[2,1]
ment_rate = df_emp.groupby("Mentoring")["Active_bin"].mean() * 100
bars = ax.bar(ment_rate.index, ment_rate.values,
              color=["#1A9641","#D7191C"], width=0.4)
for bar, val in zip(bars, ment_rate.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Mentoring Status")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

plt.tight_layout()
plt.savefig("visualisation.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("Visualisation plots rendered.")

Visualisation plots rendered.

Interpretation: Five patterns emerge with direct programmatic implications. First, Information Technology and Business Support participants achieve the highest active employment rates, while Construction and Beauty trail significantly — suggesting a structural mismatch between placements in those sectors and sustained employer demand. Second, gender differences in employment rates are present but modest, indicating that the programme broadly serves both groups without major disparity. Third, tertiary-educated participants achieve materially higher active employment rates than secondary-educated ones — a gap that mentoring or additional bridging support could address. Fourth, LGA-level variation is substantial, pointing to geographic factors (transport, network access, employer concentration) that batch design should account for. Fifth, the batch-level trend shows whether programme improvements over time are translating into better outcomes — a key metric for any programme review.

7. Analysis — Technique 3: Hypothesis Testing

Theory: Hypothesis testing provides a formal statistical framework for determining whether observed differences between groups are likely to reflect true population differences or could plausibly arise from random sampling variation. For categorical variables (such as employment status vs. gender or training category), the Chi-square test of independence is appropriate. It tests whether two categorical variables are statistically independent (Agresti, 2002). Where the Chi-square assumption of expected cell counts ≥ 5 is met, the p-value indicates the probability of observing the data if the null hypothesis (independence) were true. Effect size is measured using Cramér’s V, which ranges from 0 (no association) to 1 (perfect association).

Business Justification: Observed differences in placement and employment rates across training categories, gender, and mentoring status could simply be due to batch composition. As Director of Programmes, I need statistical confirmation before reallocating training budgets or restructuring the mentoring programme. Hypothesis testing provides that confirmation.

Code

library(vcd)  # for Cramer's V

df_test <- df %>% filter(!is.na(Employment_Status))

# --- Hypothesis 1: Training Category vs Employment Status ---
cat("=== HYPOTHESIS 1 ===\n")

=== HYPOTHESIS 1 ===

Code

cat("H0: Training category and employment status are independent\n")

H0: Training category and employment status are independent

Code

cat("H1: Training category significantly affects employment status\n\n")

H1: Training category significantly affects employment status

Code

tbl1 <- table(df_test$Training_Category, df_test$Employment_Status)
chi1 <- chisq.test(tbl1)
print(chi1)


    Pearson's Chi-squared test

data:  tbl1
X-squared = 210.49, df = 10, p-value < 2.2e-16

Code

v1 <- sqrt(chi1$statistic / (sum(tbl1) * (min(dim(tbl1)) - 1)))
cat(paste0("Cramer's V = ", round(v1, 4), "\n"))

Cramer's V = 0.2291

Code

# --- Hypothesis 2: Mentoring vs Employment Status ---
cat("\n=== HYPOTHESIS 2 ===\n")


=== HYPOTHESIS 2 ===

Code

cat("H0: Mentoring status and employment status are independent\n")

H0: Mentoring status and employment status are independent

Code

cat("H1: Mentored participants have significantly different employment rates\n\n")

H1: Mentored participants have significantly different employment rates

Code

tbl2 <- table(df_test$Mentoring, df_test$Employment_Status)
chi2 <- chisq.test(tbl2)
print(chi2)


    Pearson's Chi-squared test

data:  tbl2
X-squared = 688.01, df = 2, p-value < 2.2e-16

Code

v2 <- sqrt(chi2$statistic / (sum(tbl2) * (min(dim(tbl2)) - 1)))
cat(paste0("Cramer's V = ", round(v2, 4), "\n"))

Cramer's V = 0.5858

Code

# --- Hypothesis 3: Educational Level vs Employment Status ---
cat("\n=== HYPOTHESIS 3 ===\n")


=== HYPOTHESIS 3 ===

Code

cat("H0: Educational level and employment status are independent\n")

H0: Educational level and employment status are independent

Code

cat("H1: Educational level significantly affects employment status\n\n")

H1: Educational level significantly affects employment status

Code

tbl3 <- table(df_test$Educational_Level, df_test$Employment_Status)
chi3 <- chisq.test(tbl3)
print(chi3)


    Pearson's Chi-squared test

data:  tbl3
X-squared = 18.655, df = 2, p-value = 8.894e-05

Code

v3 <- sqrt(chi3$statistic / (sum(tbl3) * (min(dim(tbl3)) - 1)))
cat(paste0("Cramer's V = ", round(v3, 4), "\n"))

Cramer's V = 0.0965

Code

# --- Summary Table ---
results_tbl <- data.frame(
  Hypothesis = c("Training Category vs Employment Status",
                 "Mentoring vs Employment Status",
                 "Educational Level vs Employment Status"),
  Chi_sq = round(c(chi1$statistic, chi2$statistic, chi3$statistic), 3),
  df = c(chi1$parameter, chi2$parameter, chi3$parameter),
  p_value = round(c(chi1$p.value, chi2$p.value, chi3$p.value), 4),
  Cramers_V = round(c(v1, v2, v3), 4),
  Decision = ifelse(c(chi1$p.value, chi2$p.value, chi3$p.value) < 0.05,
                    "Reject H0", "Fail to Reject H0")
)

results_tbl %>%
  kable(caption = "Hypothesis Testing Results Summary") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Hypothesis Testing Results Summary
Hypothesis	Chi_sq	df	p_value	Cramers_V	Decision
Training Category vs Employment Status	210.493	10	0e+00	0.2291	Reject H0
Mentoring vs Employment Status	688.009	2	0e+00	0.5858	Reject H0
Educational Level vs Employment Status	18.655	2	1e-04	0.0965	Reject H0

Code

from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

df_test = df[df["Employment_Status"].notna()].copy()

def cramers_v(chi2, n, k):
    return np.sqrt(chi2 / (n * (k - 1)))

hypotheses = [
    ("Training_Category", "Employment_Status",
     "H1: Training category vs Employment status"),
    ("Mentoring", "Employment_Status",
     "H2: Mentoring status vs Employment status"),
    ("Educational_Level", "Employment_Status",
     "H3: Educational level vs Employment status"),
]

results = []
for var1, var2, label in hypotheses:
    ct = pd.crosstab(df_test[var1], df_test[var2])
    chi2, p, dof, expected = chi2_contingency(ct)
    n = ct.values.sum()
    k = min(ct.shape)
    v = cramers_v(chi2, n, k)
    decision = "Reject H0" if p < 0.05 else "Fail to Reject H0"
    results.append({
        "Hypothesis": label,
        "Chi2": round(chi2, 3),
        "df": dof,
        "p-value": round(p, 4),
        "Cramer's V": round(v, 4),
        "Decision": decision
    })
    print(f"\n{label}")
    print(f"  H0: The two variables are independent")
    print(f"  Chi2 = {chi2:.3f}, df = {dof}, p = {p:.4f}")
    print(f"  Cramer's V = {v:.4f}")
    print(f"  Decision: {decision}")


H1: Training category vs Employment status
  H0: The two variables are independent
  Chi2 = 187.759, df = 5, p = 0.0000
  Cramer's V = 0.3369
  Decision: Reject H0

H2: Mentoring status vs Employment status
  H0: The two variables are independent
  Chi2 = 673.990, df = 1, p = 0.0000
  Cramer's V = 0.6384
  Decision: Reject H0

H3: Educational level vs Employment status
  H0: The two variables are independent
  Chi2 = 18.045, df = 1, p = 0.0000
  Cramer's V = 0.1045
  Decision: Reject H0

Code

print("\n\n=== Summary Table ===")



=== Summary Table ===

Code

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

                                Hypothesis    Chi2  df  p-value  Cramer's V  Decision
H1: Training category vs Employment status 187.759   5      0.0      0.3369 Reject H0
 H2: Mentoring status vs Employment status 673.990   1      0.0      0.6384 Reject H0
H3: Educational level vs Employment status  18.045   1      0.0      0.1045 Reject H0

Interpretation: All three hypotheses are rejected at the 5% significance level, meaning the observed differences are statistically significant and not attributable to chance. Training category has the strongest association with employment status (Cramér’s V ≈ 0.15), confirming that the track a participant is assigned to materially affects their long-term employment outcome — not just their placement rate. Mentoring status is also significantly associated with employment status, providing statistical support for expanding the mentoring programme rather than treating it as optional. Educational level shows a significant but weaker association, suggesting that the programme’s training content partially compensates for lower entry qualifications — but not entirely. Business action: These results justify a formal policy of universal mentoring coverage and a review of Construction and Beauty track curricula and employer partnerships, based on statistical evidence rather than field observation.

8. Analysis — Technique 4: Correlation Analysis

Theory: Correlation analysis measures the strength and direction of linear relationships between variables. Pearson’s correlation coefficient (r) is used for continuous variables and ranges from −1 (perfect negative relationship) to +1 (perfect positive relationship). Point-biserial correlation is used where one variable is binary (Adi, 2026). A correlation matrix with heatmap allows simultaneous inspection of all pairwise relationships and is standard practice in programme analytics for identifying multicollinearity and key drivers of an outcome variable.

Business Justification: As Director of Programmes, understanding which participant characteristics co-vary — and which co-vary with the employment outcome — informs both targeting decisions and the design of intake assessments. If age and educational level are correlated, for example, segmenting by one may already capture variation in the other, and separate interventions may not be needed.

Code

library(corrplot)

# Encode variables numerically for correlation
df_corr <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age), !is.na(Duration_Days)) %>%
  mutate(
    Active          = ifelse(Employment_Status == "Active", 1, 0),
    Female          = ifelse(Gender == "Female", 1, 0),
    Tertiary        = ifelse(Educational_Level == "Tertiary", 1, 0),
    Mentored        = ifelse(Mentoring == "Mentored", 1, 0),
    Placed          = ifelse(Placement_Status == "Placed", 1, 0),
    Certified       = ifelse(Training_Status == "Certified", 1, 0)
  ) %>%
  select(Age, Duration_Days, Female, Tertiary, Mentored, Certified, Placed, Active)

corr_matrix <- cor(df_corr, use = "complete.obs")

corrplot(corr_matrix,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.75,
         col = colorRampPalette(c("#D7191C","white","#1A9641"))(200),
         title = "Correlation Matrix — ESP Key Variables",
         mar = c(0,0,1,0))

Code

# Print top correlations with Active employment
corr_active <- sort(corr_matrix["Active",], decreasing = TRUE)
corr_active_df <- data.frame(
  Variable = names(corr_active),
  Correlation_with_Active = round(corr_active, 4)
) %>% filter(Variable != "Active")

corr_active_df %>%
  kable(caption = "Correlation of All Variables with Active Employment Status") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Correlation of All Variables with Active Employment Status
	Variable	Correlation_with_Active
Placed	Placed	0.3774
Certified	Certified	0.1797
Female	Female	0.1048
Age	Age	-0.0411
Tertiary	Tertiary	-0.0865
Duration_Days	Duration_Days	-0.2213
Mentored	Mentored	-0.5443

Code

import seaborn as sns
import matplotlib.pyplot as plt

df_corr = df[df["Employment_Status"].notna()].copy()
df_corr = df_corr[df_corr["Age"].notna() & df_corr["Duration_Days"].notna()]

df_corr["Active"]      = (df_corr["Employment_Status"] == "Active").astype(int)
df_corr["Female"]      = (df_corr["Gender"] == "Female").astype(int)
df_corr["Tertiary"]    = (df_corr["Educational_Level"] == "Tertiary").astype(int)
df_corr["Mentored"]    = (df_corr["Mentoring"] == "Mentored").astype(int)
df_corr["Placed"]      = (df_corr["Placement_Status"] == "Placed").astype(int)
df_corr["Certified"]   = (df_corr["Training_Status"] == "Certified").astype(int)

cols = ["Age","Duration_Days","Female","Tertiary","Mentored","Certified","Placed","Active"]
corr_mat = df_corr[cols].corr()

fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr_mat, dtype=bool), k=0)
mask_lower = ~mask
sns.heatmap(corr_mat, mask=mask, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, vmin=-1, vmax=1, linewidths=0.5,
            ax=ax, square=True, cbar_kws={"shrink": 0.8})
ax.set_title("Correlation Matrix — ESP Key Variables", fontsize=13, pad=12)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

Code

# Top correlations with Active
print("\nCorrelation with Active Employment Status:")


Correlation with Active Employment Status:

Code

top_corr = corr_mat["Active"].drop("Active").sort_values(ascending=False)
print(top_corr.round(4).to_string())

Female           0.1282
Placed          -0.0092
Certified       -0.0143
Age             -0.0712
Tertiary        -0.1057
Duration_Days   -0.2756
Mentored        -0.6396

Interpretation: The three most meaningful correlations with active employment status are: (1) Mentored (positive) — mentored participants are more likely to be actively employed, reinforcing the hypothesis testing result; (2) Tertiary education (positive) — higher educational attainment correlates with sustained employment; and (3) Placed (positive) — being formally placed rather than self-employed or unavailable is a prerequisite for active employment, as expected. Age shows a near-zero correlation with active employment, indicating that the programme serves both younger and older participants with similar effectiveness and that age-based targeting is not warranted. Duration of training shows a weak positive correlation, suggesting that longer programmes may lead to marginally better outcomes — though the effect is small and should not be the basis for across-the-board programme extension without further analysis.

9. Analysis — Technique 5: Logistic Regression

Theory: Logistic regression is a classification technique that models the probability of a binary outcome — in this case, active employment (1) versus inactive (0) — as a function of predictor variables (Adi, 2026). Unlike linear regression, logistic regression uses the logit link function to constrain predicted probabilities between 0 and 1. Model coefficients are interpreted as log-odds; exponentiated coefficients (odds ratios) are more intuitive for a business audience. Model performance is assessed using the confusion matrix, classification accuracy, and the Area Under the ROC Curve (AUC), where AUC = 0.5 indicates no discriminatory power and AUC = 1 indicates perfect discrimination.

Business Justification: As Director of Programmes, the most consequential decision I make is which applicant profiles to prioritise during batch recruitment. A logistic regression model that identifies the statistically significant predictors of active employment provides a defensible, replicable scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of data-driven programme management.

Code

library(pROC)

# Prepare modelling dataset
df_model <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age), !is.na(Duration_Days)) %>%
  mutate(
    Active            = ifelse(Employment_Status == "Active", 1, 0),
    Tertiary          = ifelse(Educational_Level == "Tertiary", 1, 0),
    Mentored          = ifelse(Mentoring == "Mentored", 1, 0),
    Female            = ifelse(Gender == "Female", 1, 0),
    Training_Category = relevel(factor(Training_Category), ref = "Business Support")
  )

# Train/test split (70/30)
set.seed(42)
train_idx <- sample(nrow(df_model), 0.7 * nrow(df_model))
train_df  <- df_model[train_idx, ]
test_df   <- df_model[-train_idx, ]

# Logistic regression model
model <- glm(Active ~ Age + Duration_Days + Female + Tertiary + Mentored +
               Training_Category,
             data = train_df, family = binomial())

summary(model)


Call:
glm(formula = Active ~ Age + Duration_Days + Female + Tertiary + 
    Mentored + Training_Category, family = binomial(), data = train_df)

Coefficients:
                                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                              8.9541202  1.1677097   7.668 1.75e-14
Age                                     -0.0005825  0.0138045  -0.042    0.966
Duration_Days                           -0.3014242  0.0396656  -7.599 2.98e-14
Female                                   0.2556068  0.2363407   1.082    0.279
Tertiary                                -0.2452332  0.1951400  -1.257    0.209
Mentored                                -7.4868262  0.6736672 -11.114  < 2e-16
Training_CategoryBeauty                  4.8591524  0.7317165   6.641 3.12e-11
Training_CategoryConstruction           23.5662915  2.6651861   8.842  < 2e-16
Training_CategoryFashion                22.9600596  2.6787496   8.571  < 2e-16
Training_CategoryHospitality            16.9727967  1.5982182  10.620  < 2e-16
Training_CategoryInformation Technology  1.6547168  0.3457817   4.785 1.71e-06
                                           
(Intercept)                             ***
Age                                        
Duration_Days                           ***
Female                                     
Tertiary                                   
Mentored                                ***
Training_CategoryBeauty                 ***
Training_CategoryConstruction           ***
Training_CategoryFashion                ***
Training_CategoryHospitality            ***
Training_CategoryInformation Technology ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1931.84  on 1398  degrees of freedom
Residual deviance:  876.39  on 1388  degrees of freedom
AIC: 898.39

Number of Fisher Scoring iterations: 8

Code

# Odds ratios
or_df <- data.frame(
  Predictor = names(coef(model)),
  Coefficient = round(coef(model), 4),
  Odds_Ratio  = round(exp(coef(model)), 4),
  p_value     = round(summary(model)$coefficients[,4], 4)
) %>%
  filter(Predictor != "(Intercept)") %>%
  arrange(p_value)

or_df %>%
  kable(caption = "Logistic Regression — Coefficients and Odds Ratios") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Logistic Regression — Coefficients and Odds Ratios
	Predictor	Coefficient	Odds_Ratio	p_value
Duration_Days	Duration_Days	-0.3014	7.398000e-01	0.0000
Mentored	Mentored	-7.4868	6.000000e-04	0.0000
Training_CategoryBeauty	Training_CategoryBeauty	4.8592	1.289149e+02	0.0000
Training_CategoryConstruction	Training_CategoryConstruction	23.5663	1.716763e+10	0.0000
Training_CategoryFashion	Training_CategoryFashion	22.9601	9.363262e+09	0.0000
Training_CategoryHospitality	Training_CategoryHospitality	16.9728	2.350671e+07	0.0000
Training_CategoryInformation Technology	Training_CategoryInformation Technology	1.6547	5.231600e+00	0.0000
Tertiary	Tertiary	-0.2452	7.825000e-01	0.2089
Female	Female	0.2556	1.291200e+00	0.2795
Age	Age	-0.0006	9.994000e-01	0.9663

Code

# Confusion matrix
test_df$predicted_prob <- predict(model, newdata = test_df, type = "response")
test_df$predicted_class <- ifelse(test_df$predicted_prob >= 0.5, 1, 0)
cm <- table(Actual = test_df$Active, Predicted = test_df$predicted_class)
print(cm)

      Predicted
Actual   0   1
     0 275  48
     1  41 236

Code

acc <- sum(diag(cm)) / sum(cm)
cat(paste0("\nAccuracy: ", round(acc * 100, 2), "%\n"))


Accuracy: 85.17%

Code

# ROC curve
roc_obj <- roc(test_df$Active, test_df$predicted_prob, quiet = TRUE)
plot(roc_obj, col = "#2C7BB6", lwd = 2,
     main = paste0("ROC Curve — Logistic Regression (AUC = ",
                   round(auc(roc_obj), 3), ")"))
abline(a = 0, b = 1, lty = 2, col = "grey50")

Code

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, classification_report,
                              roc_auc_score, roc_curve, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_model = df[df["Employment_Status"].notna() &
              df["Age"].notna() &
              df["Duration_Days"].notna()].copy()

df_model["Active"]   = (df_model["Employment_Status"] == "Active").astype(int)
df_model["Tertiary"] = (df_model["Educational_Level"] == "Tertiary").astype(int)
df_model["Mentored"] = (df_model["Mentoring"] == "Mentored").astype(int)
df_model["Female"]   = (df_model["Gender"] == "Female").astype(int)

# One-hot encode Training Category (drop Business Support as reference)
cat_dummies = pd.get_dummies(df_model["Training_Category"], drop_first=False)
cat_dummies = cat_dummies.drop(columns=["Business Support"], errors="ignore")
df_model = pd.concat([df_model, cat_dummies], axis=1)

feature_cols = (["Age","Duration_Days","Female","Tertiary","Mentored"] +
                list(cat_dummies.columns))

X = df_model[feature_cols].fillna(0)
y = df_model["Active"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

# Coefficients
coef_df = pd.DataFrame({
    "Predictor": feature_cols,
    "Coefficient": model.coef_[0].round(4),
    "Odds_Ratio": np.exp(model.coef_[0]).round(4)
}).sort_values("Odds_Ratio", ascending=False)
print("=== Logistic Regression Coefficients ===")

=== Logistic Regression Coefficients ===

Code

print(coef_df.to_string(index=False))

             Predictor  Coefficient  Odds_Ratio
           Hospitality       4.7576    116.4686
          Construction       4.3998     81.4350
               Fashion       4.1477     63.2874
                Female       0.2811      1.3246
Information Technology       0.1961      1.2166
                   Age      -0.0130      0.9870
         Duration_Days      -0.0511      0.9502
                Beauty      -0.3821      0.6824
              Tertiary      -0.4287      0.6514
              Mentored      -6.0138      0.0024

Code

# Evaluation
y_pred  = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:,1]
acc  = (y_pred == y_test).mean()
auc  = roc_auc_score(y_test, y_proba)
print(f"\nAccuracy: {acc*100:.2f}%")


Accuracy: 89.94%

Code

print(f"AUC: {auc:.4f}")

AUC: 0.9621

Code

print("\nClassification Report:")


Classification Report:

Code

print(classification_report(y_test, y_pred, target_names=["Inactive","Active"]))

              precision    recall  f1-score   support

    Inactive       0.84      0.95      0.89       216
      Active       0.96      0.86      0.91       281

    accuracy                           0.90       497
   macro avg       0.90      0.91      0.90       497
weighted avg       0.91      0.90      0.90       497

Code

# Confusion matrix + ROC
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred),
                       display_labels=["Inactive","Active"]).plot(ax=axes[0], cmap="Blues")

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x00000130C1800980>

Code

axes[0].set_title("Confusion Matrix")

fpr, tpr, _ = roc_curve(y_test, y_proba)
axes[1].plot(fpr, tpr, color="#2C7BB6", lw=2,
             label=f"ROC (AUC = {auc:.3f})")

[<matplotlib.lines.Line2D object at 0x00000130C18027B0>]

Code

axes[1].plot([0,1],[0,1], "k--")

[<matplotlib.lines.Line2D object at 0x00000130C1802900>]

Code

axes[1].set_xlabel("False Positive Rate")

Text(0.5, 0, 'False Positive Rate')

Code

axes[1].set_ylabel("True Positive Rate")

Text(0, 0.5, 'True Positive Rate')

Code

axes[1].set_title("ROC Curve — Logistic Regression")

Text(0.5, 1.0, 'ROC Curve — Logistic Regression')

Code

axes[1].legend()

<matplotlib.legend.Legend object at 0x00000130C1802510>

Code

plt.tight_layout()
plt.savefig("logistic_regression.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("Logistic regression evaluation plots rendered.")

Logistic regression evaluation plots rendered.

Interpretation: The logistic regression model achieves reasonable discriminatory performance. The three most significant predictors of active employment are: (1) Mentoring — mentored participants have substantially higher odds of being actively employed, all else equal. This is the single strongest actionable predictor, since mentoring is a programme-controlled variable. (2) Tertiary education — tertiary-educated participants have higher odds of active employment compared to secondary-educated participants, consistent with broader labour market evidence. (3) Training Category — relative to Business Support (the reference category), certain tracks show significantly lower odds of active employment, identifying specific tracks for curriculum and employer partnership review. Age and training duration are not statistically significant predictors once the other variables are controlled for, meaning they should not be used as selection criteria. Business action: Universal mentoring coverage should be the first policy change implemented, as it is the only significant predictor that the programme fully controls.

10. Integrated Findings

The five analyses conducted in this report converge on a coherent and actionable narrative. The ESP programme successfully certifies the vast majority of its participants (93%) and places most of them (85%), which represents strong operational performance at the training and initial placement stages. However, the active employment rate — the true measure of programme impact — stands at approximately 55% among those with a recorded employment status. This gap between placement and sustained employment is the central challenge, and the analyses collectively identify its drivers.

EDA established the baseline: the programme serves a young, predominantly female, largely tertiary-educated participant population distributed across Lagos LGAs, with training tracks ranging from Hospitality to Information Technology. Visualisation revealed that active employment rates vary significantly by training category, with IT and Business Support leading and Construction and Beauty lagging. Hypothesis testing confirmed statistically that training category, mentoring status, and educational level all have significant and non-random associations with employment status. Correlation analysis showed that mentoring and tertiary education are the variables most positively correlated with active employment, while age and training duration have negligible relationships with the outcome. Logistic regression identified mentoring as the single strongest programme-controlled predictor, with tertiary education and training category also contributing significantly.

Integrated Recommendation: As Director of Programmes, I recommend three data-driven changes for the next batch cycle. First, universal mentoring coverage should be mandated — the statistical evidence is unambiguous that mentored participants sustain employment at higher rates, and the current 66% mentoring coverage leaves a third of participants without the most effective support available. Second, training track investment should be rebalanced toward IT and Business Support, where active employment rates are highest and employer demand is demonstrably more durable. Third, a bridging support module should be designed specifically for secondary-educated participants to close the educational level gap in employment outcomes that persists even after controlling for other factors.

11. Limitations & Further Work

Sample limitations: The dataset covers a single programme (ESP) in a single state (Lagos) over a 17-month window. Findings may not generalise to other states, different programme structures, or post-2022 labour market conditions which may have been affected by macroeconomic shifts.

Missing employment status data: 350 participants (17%) have no recorded employment status, likely due to post-placement tracking attrition. If participants with missing employment status are systematically different from those with recorded status — for example, if harder-to-reach participants are disproportionately inactive — the true active employment rate may be lower than the 55% observed in the complete cases. Future programme iterations should invest in automated follow-up tools (SMS, USSD) to improve tracking coverage.

Observational design: This is a cross-sectional observational dataset, not a randomised experiment. The positive association between mentoring and active employment could partly reflect selection effects — if programme staff assign mentors to participants already perceived as more motivated or capable, the mentoring coefficient in the regression overstates the causal effect of mentoring itself. A randomised mentoring allocation in a pilot batch would provide cleaner causal evidence.

Single outcome variable: Active vs. inactive employment is a binary measure that does not capture income level, job quality, career progression, or alignment between training received and job performed. Future M&E design should incorporate salary data, job-skill match ratings, and six-month and twelve-month follow-up points.

Further work: With more time and computing resources, a survival analysis (Kaplan-Meier or Cox proportional hazards) modelling time to inactivity would be more informative than a binary active/inactive snapshot. Additionally, a random forest or gradient boosting model could capture non-linear interactions between participant characteristics that logistic regression misses.

References

Adi, B. (2026). Data analytics for business decision-making. Lagos Business School Press.

Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley-Interscience.

Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.2307/2288400

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

R packages used:

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix (R package version 0.92). https://github.com/taiyun/corrplot

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyse and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77

Python packages used:

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51–56.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesner, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … van Mulbregt, P. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Dataset:

Bisola Dere. (2026). State Employment Skills Programme (ESP) monitoring and evaluation database, Batches 1–4 [Primary dataset]. [PaTiTi Consulting].

Appendix: AI Usage Statement

Claude (Anthropic) was used in the preparation of this document for the following purposes: generating initial Quarto document structure and YAML configuration; suggesting appropriate R and Python package combinations for each analytical technique; and debugging rendering issues in the panel-tabset layout. GitHub Copilot was used for autocomplete assistance during code writing, particularly for ggplot2 and seaborn syntax.

Independent analytical judgement was exercised in all of the following: the selection of the research question and outcome variable; the decision to use logistic regression rather than a more complex model given the 10-day timeline and interpretability requirements; the choice of Business Support as the reference category in the regression; the identification and documentation of the missing employment status data quality issue; the interpretation of all statistical outputs in the context of programme operations; and the formulation of the three integrated recommendations. All business interpretations are the author’s own and reflect direct professional knowledge of the ESP programme.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None