Predicting Active Employment Outcomes in UpSkilling Programmes

Author

B. Dere

Published

May 20, 2026


1. Executive Summary

This study analyses data from various vocational training and apprenticeship initiatives that enrolled approximately 1,000 participants across multiple cohorts between 2021 and 2024. The central business question is: which participant profiles and training categories are most likely to result in active employment, and how can this knowledge make programme design, trainer allocation, and apprenticeship partnership decisions more data-driven?

Data were extracted from a pooled anonymised programme database covering 20 variables including demographic characteristics, training track, career coaching status, certification outcome, apprenticeship status, and final employment status. Six training categories were covered: Business Support Services (BSS), Information and Communications Technology (ICT), Beauty (BTY), Construction (CTN), Fashion (FSN), and Hospitality (HSP). Five analytical techniques were applied: Exploratory Data Analysis (EDA), Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression. Key findings reveal that training category and educational level are the strongest structural predictors of active employment, while career coaching assignment reflects a selection effect that warrants further causal investigation. Participants in ICT and BSS tracks and those with tertiary education consistently show higher rates of active employment. The integrated recommendation is that future cohorts should invest in ICT and BSS tracks and design a randomised career coaching pilot to establish true causal effect.


2. Professional Disclosure

Job Title: Director of Programmes
Organisation/Sector: PC LTD | Skills Development / Social Impact

Exploratory Data Analysis (EDA): As Director of Programmes, understanding who is enrolling in the programme before making design decisions is foundational. EDA allows me to monitor whether the programme is reaching its intended demographic — young, low-income residents within the state — and to detect shifts in participant profiles across cohorts that would require a programmatic response. Without this diagnostic layer, decisions on recruitment targeting and resource allocation are based on assumption rather than evidence.

Visualisation: Reporting to funders, government partners, and the board requires communicating programme performance clearly and efficiently. Visualisation of apprenticeship rates by training category, gender, and location gives me a dashboard-ready view that directly informs which tracks to scale, discontinue, or redesign in future cohorts. It also enables me to present evidence to partners in a form that drives decisions rather than simply informs them.

Hypothesis Testing: A recurring operational question I face is whether observed differences in outcomes — for example, whether coached participants achieve employment at higher rates than uncoached ones, or whether one training category outperforms another — are statistically real or simply sampling noise. Hypothesis testing provides the rigorous basis needed to act on those differences, justify resource reallocation, and defend programmatic changes to stakeholders.

Correlation Analysis: As Director, I allocate career coaching resources and decide which participant segments receive intensive support. Correlation analysis between age, educational level, training duration, and employment outcomes allows me to test whether those targeting assumptions are evidence-based or inherited from previous programme designs. Understanding which variables move together informs more precise targeting in future cohorts.

Logistic Regression: The most consequential decision I make each cycle is which applicant profiles to prioritise during recruitment. A logistic regression model identifying the strongest predictors of active employment provides a defensible, data-driven scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of systematic programme improvement.


3. Data Collection & Sampling

Source: UpSkilling Programmes pooled anonymised monitoring and evaluation (M&E) database, maintained by the Directorate of Programmes.

Collection Method: The dataset was extracted from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a state within Nigeria. No third-party data was used.

Sampling Frame: Participants registered across multiple programme cohorts between 2021 and 2024. A stratified random sub-sample of 1,000 participants was drawn to preserve the Active/Inactive employment ratio of the full population while protecting the identity of any individual cohort or project.

Sample Size: 1,000 observations (participants) across 20 variables.

Time Period: January 2021 to December 2024 (approximately 4 years).

Variables: The dataset includes career coaching status, enrollment ID, qualification, educational level, age, age bracket, location code, marital status, gender, training category, training sub-category, programme start and end dates, training status, apprenticeship status, reason if declined, date of employment, employment type, employment status, and job formality classification.

Training Category Codes: BSS = Business Support Services; ICT = Information and Communications Technology; BTY = Beauty; CTN = Construction; FSN = Fashion; HSP = Hospitality.

Training Sub-Category Codes: ABG = Agent Banking; BC = Bakery and Confectioneries; FSN = Fashion; DGM = Digital Marketing; ELC = Electrical; PLM = Plumbing; CSM = Cosmetology; HDR = Hair Dressing; TLG = Tiling; CSR = Customer Service; MSN = Masonry; S&M = Sales and Marketing.

Ethical & Consent Notes: The dataset used in this analysis is a randomly drawn sub-sample from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a Nigerian state between 2021 and 2024. All personally identifiable information — including names, national identification numbers, phone numbers, and employer details — has been removed prior to extraction. Enrollment IDs are synthetic identifiers assigned for analytical purposes only and do not correspond to any original programme record. Geographic identifiers have been replaced with randomised location codes (e.g. LGA 14, LGA 80, LGA 144) that do not correspond to any named administrative unit and do not reveal the state or region of programme delivery. The pooled dataset design means that no single project, cohort, or organisation can be identified from the analytical extract. This analysis has been conducted in accordance with the Nigerian Data Protection Act 2023 (NDPA) and the data governance principles of the originating organisation.

Data Quality Issues Identified: 1. Missing Employment Status: Participants with no recorded employment status were excluded from the analytical sample. These are predominantly participants still within the programme cycle at the time of data extraction or those who became unreachable during follow-up. 2. Age Outlier: One participant is recorded as age 1, which is clearly a data entry error. This record is excluded from age-related analyses.


4. Data Description

Code
# Install any missing packages silently
pkgs <- c("tidyverse","skimr","knitr","kableExtra","lubridate",
          "ggplot2","scales","vcd","corrplot","pROC")
new_pkgs <- pkgs[!pkgs %in% installed.packages()[,"Package"]]
if (length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE)

# Load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(kableExtra)
library(lubridate)

# Load data
df <- read.csv("Data_R_fixed.csv", stringsAsFactors = FALSE)

# Fix: replace blank Employment_Status with NA
df$Employment_Status[df$Employment_Status == ""] <- NA

# Parse dates
df$Start_Date    <- as.Date(parse_date_time(df$Start_Date,
                    orders = c("ymd", "dmy", "mdy", "ymd HMS")))
df$End_Date      <- as.Date(parse_date_time(df$End_Date,
                    orders = c("ymd", "dmy", "mdy", "ymd HMS")))
df$Duration_Days <- as.numeric(df$End_Date - df$Start_Date)

# Remove obvious age error
df <- df %>% filter(is.na(Age) | Age > 5)

# Summary table of key numeric variables
df %>%
  select(Age, Duration_Days) %>%
  summary() %>%
  kable(caption = "Summary Statistics — Numeric Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Summary Statistics — Numeric Variables
Age Duration_Days
Min. :18.00 Min. : 18.00
1st Qu.:25.00 1st Qu.: 28.00
Median :30.00 Median : 56.00
Mean :30.14 Mean : 53.98
3rd Qu.:35.00 3rd Qu.: 84.00
Max. :66.00 Max. :109.00
NA's :1 NA
Code
# Categorical summaries
cat_summary <- data.frame(
  Variable = c("Gender", "Educational Level", "Training Status",
               "Apprenticeship Status", "Employment Status", "Career Coaching"),
  Categories = c(
    paste(names(table(df$Gender)),                 collapse = " | "),
    paste(names(table(df$Educational_Level)),       collapse = " | "),
    paste(names(table(df$Training_Status)),         collapse = " | "),
    paste(names(table(df$Apprenticeship_Status)),   collapse = " | "),
    paste(names(table(df$Employment_Status)),       collapse = " | "),
    paste(names(table(df$Career_Coaching)),         collapse = " | ")
  ),
  N_Valid = c(
    sum(!is.na(df$Gender)),
    sum(!is.na(df$Educational_Level)),
    sum(!is.na(df$Training_Status)),
    sum(!is.na(df$Apprenticeship_Status)),
    sum(!is.na(df$Employment_Status)),
    sum(!is.na(df$Career_Coaching))
  )
)

cat_summary %>%
  kable(caption = "Categorical Variable Overview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Categorical Variable Overview
Variable Categories N_Valid
Gender Female &#124; Male 999
Educational Level Secondary &#124; Tertiary 999
Training Status Certified &#124; Not Certified 999
Apprenticeship Status Matched &#124; Self-Directed &#124; Unmatched 999
Employment Status Active &#124; Inactive 999
Career Coaching Coached &#124; Not Coached 999
Code
import pandas as pd
import numpy as np

df = pd.read_csv("Data_R_fixed.csv", encoding="latin1")

# Fix blank Employment_Status
df["Employment_Status"] = df["Employment_Status"].replace("", float("nan"))

# Parse dates
df["Start_Date"]    = pd.to_datetime(df["Start_Date"], errors="coerce")
df["End_Date"]      = pd.to_datetime(df["End_Date"],   errors="coerce")
df["Duration_Days"] = (df["End_Date"] - df["Start_Date"]).dt.days

# Remove age outlier
df = df[(df["Age"].isna()) | (df["Age"] > 5)]

# Numeric summary
numeric_summary = df[["Age","Duration_Days"]].describe().round(2)
print("=== Numeric Variable Summary ===")
=== Numeric Variable Summary ===
Code
print(numeric_summary.to_string())
          Age  Duration_Days
count  998.00         876.00
mean    30.14          52.83
std      6.90          28.36
min     18.00          18.00
25%     25.00          27.00
50%     30.00          56.00
75%     35.00          85.00
max     66.00         109.00
Code
# Categorical summary
print("\n=== Categorical Variable Counts ===")

=== Categorical Variable Counts ===
Code
for col in ["Gender","Educational_Level","Training_Status","Apprenticeship_Status",
            "Employment_Status","Career_Coaching","Training_Category","Training_SubCategory"]:
    print(f"\n{col}:")
    print(df[col].value_counts(dropna=False).to_string())

Gender:
Gender
Female    606
Male      393

Educational_Level:
Educational_Level
Tertiary     653
Secondary    346

Training_Status:
Training_Status
Certified        987
Not Certified     12

Apprenticeship_Status:
Apprenticeship_Status
Matched          996
Self-Directed      2
Unmatched          1

Employment_Status:
Employment_Status
Active      558
Inactive    441

Career_Coaching:
Career_Coaching
Coached        648
Not Coached    351

Training_Category:
Training_Category
BSS    231
CTN    180
FSN    179
HSP    179
ICT    138
BTY     92

Training_SubCategory:
Training_SubCategory
ABG    201
FSN    179
BC     179
DGM    138
ELC     83
PLM     49
CSM     48
HDR     44
TLG     34
CSR     18
MSN     14
S&M     12

Interpretation: The dataset covers approximately 1,000 participants with a mean age of approximately 30 years, reflecting a young workforce target population. Female participants (approximately 61%) outnumber male participants (39%). The majority hold tertiary qualifications and over 90% achieved certification. Approximately 56% of participants in the sample are actively employed at the time of data extraction, pointing to a significant post-apprenticeship retention or engagement challenge that this analysis seeks to understand. The six training categories — BSS, ICT, BTY, CTN, FSN, and HSP — span a broad range of vocational skills, with BSS being the largest track by enrolment.


5. Analysis — Technique 1: Exploratory Data Analysis

Theory: Exploratory Data Analysis (EDA) is the practice of summarising a dataset’s main characteristics, often using statistical and visual methods, before formal modelling. It is grounded in the tradition established by Tukey (1977) and is used to detect patterns, anomalies, and relationships that guide subsequent analysis. EDA is particularly important for administrative datasets where data quality issues are common and assumptions about the population may not hold.

Business Justification: As Director of Programmes, EDA provides the foundation for all downstream decisions. Before allocating resources, adjusting training tracks, or redesigning recruitment, I need to know who is actually in the programme, how they are distributed across demographic categories, and where the data has gaps. EDA surfaces these facts systematically rather than relying on anecdotal reports from field staff.

Code
library(ggplot2)
library(scales)

# Age distribution
p1 <- ggplot(df %>% filter(!is.na(Age)), aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#2C7BB6", colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = mean(Age, na.rm = TRUE)),
             colour = "#D7191C", linetype = "dashed", linewidth = 1) +
  labs(title = "Age Distribution of USP Participants",
       subtitle = "Dashed line = mean age (≈30 years)",
       x = "Age (years)", y = "Count") +
  theme_minimal(base_size = 13)
print(p1)

Code
# Apprenticeship status distribution
app_counts <- df %>%
  count(Apprenticeship_Status) %>%
  mutate(pct = n / sum(n) * 100)

p2 <- ggplot(app_counts,
             aes(x = reorder(Apprenticeship_Status, -n), y = pct,
                 fill = Apprenticeship_Status)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, size = 4) +
  labs(title = "Apprenticeship Status of USP Participants",
       x = "Apprenticeship Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p2)

Code
# Employment status breakdown
emp_counts <- df %>%
  filter(!is.na(Employment_Status)) %>%
  count(Employment_Status) %>%
  mutate(pct = n / sum(n) * 100)

p3 <- ggplot(emp_counts, aes(x = Employment_Status, y = pct, fill = Employment_Status)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("Active" = "#1A9641", "Inactive" = "#D7191C")) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, size = 4) +
  labs(title = "Employment Status Among USP Participants",
       subtitle = "Participants with no recorded employment status excluded",
       x = "Employment Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p3)

Code
# Training category distribution
cat_counts <- df %>%
  count(Training_Category) %>%
  mutate(pct = n / sum(n) * 100)

p4 <- ggplot(cat_counts, aes(x = reorder(Training_Category, pct), y = pct, fill = pct)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#2C7BB6") +
  labs(title = "Participant Distribution by Training Category",
       subtitle = "BSS=Business Support Services, ICT=Information & Comms Tech,\nBTY=Beauty, CTN=Construction, FSN=Fashion, HSP=Hospitality",
       x = "Training Category", y = "Percentage (%)") +
  theme_minimal(base_size = 13) +
  ylim(0, 35)
print(p4)

Code
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("EDA — USP Programme Overview", fontsize=15, fontweight="bold", y=1.01)

# Age distribution
ax = axes[0, 0]
age_clean = df["Age"].dropna()
ax.hist(age_clean, bins=15, color="#2C7BB6", edgecolor="white", alpha=0.85)
ax.axvline(age_clean.mean(), color="#D7191C", linestyle="--", linewidth=1.5)
ax.set_title("Age Distribution")
ax.set_xlabel("Age (years)")
ax.set_ylabel("Count")

# Training Category distribution
ax = axes[0, 1]
cat_counts = df["Training_Category"].value_counts()
cat_pct = cat_counts / cat_counts.sum() * 100
bars = ax.barh(cat_pct.index, cat_pct.values, color="#2C7BB6")
for bar, val in zip(bars, cat_pct.values):
    ax.text(val + 0.3, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("Training Category Distribution\n(BSS/ICT/BTY/CTN/FSN/HSP)")
ax.set_xlabel("Percentage (%)")

# Apprenticeship Status
ax = axes[1, 0]
pl_counts = df["Apprenticeship_Status"].value_counts()
pl_pct    = pl_counts / pl_counts.sum() * 100
bars = ax.bar(pl_pct.index, pl_pct.values, color="#2C7BB6")
for bar, val in zip(bars, pl_pct.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Apprenticeship Status (%)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")
ax.tick_params(axis="x", rotation=15)

# Employment Status
ax = axes[1, 1]
emp     = df["Employment_Status"].dropna().value_counts()
emp_pct = emp / emp.sum() * 100
bars = ax.bar(emp_pct.index, emp_pct.values,
              color=["#1A9641" if x == "Active" else "#D7191C"
                     for x in emp_pct.index])
for bar, val in zip(bars, emp_pct.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Employment Status (Recorded Only)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")

plt.tight_layout()
plt.savefig("eda_overview.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print("EDA plots rendered.")
EDA plots rendered.

Interpretation: The EDA reveals four critical programme facts. First, the participant population is young (mean age ≈ 30 years), consistent with the programme’s youth employment mandate, though ages range up to 66, suggesting some adult re-skilling participation. Second, certification rates are high (above 90%), indicating that training delivery is effective at getting participants to completion. Third, the apprenticeship infrastructure is functioning well — the more pressing issue is retention: among those with a recorded employment status, approximately 56% are active and 44% are inactive. This active employment gap is the central problem this analysis seeks to explain. BSS is the largest training track by enrolment, followed by CTN and FSN.


6. Analysis — Technique 2: Visualisation

Theory: Data visualisation translates analytical findings into perceptual representations that allow patterns to be understood more rapidly than through tables alone. Effective visualisation for programme analytics follows the principle of encoding the most important comparison as the primary visual channel (Cleveland & McGill, 1984). For categorical outcome data such as employment status, bar charts with percentage encoding are the most reliable form.

Business Justification: As Director of Programmes, visualisation is the primary tool for communicating performance to funders, government partners, and the board. The plots below are designed to answer the most frequently asked questions in programme review meetings: which training track performs best, does gender affect outcomes, does education level matter, and which locations are underperforming?

Code
# Active employment rate by Training Category
active_cat <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Training_Category) %>%
  summarise(
    Total  = n(),
    Active = sum(Employment_Status == "Active"),
    Rate   = Active / Total * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(Rate))

p1 <- ggplot(active_cat,
             aes(x = reorder(Training_Category, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Active Employment Rate by Training Category",
       subtitle = "BSS=Business Support Services  ICT=Info & Comms Tech\nBTY=Beauty  CTN=Construction  FSN=Fashion  HSP=Hospitality",
       x = "Training Category", y = "Active Employment Rate (%)",
       fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p1)

Code
# Active employment rate by Training Sub-Category
active_subcat <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Training_SubCategory) %>%
  summarise(
    Total  = n(),
    Active = sum(Employment_Status == "Active"),
    Rate   = Active / Total * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(Rate))

p2 <- ggplot(active_subcat,
             aes(x = reorder(Training_SubCategory, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Active Employment Rate by Training Sub-Category",
       x = "Sub-Category", y = "Active Employment Rate (%)", fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 100)
print(p2)

Code
# Active employment rate by Gender
active_gender <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Gender) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p3 <- ggplot(active_gender, aes(x = Gender, y = Rate, fill = Gender)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Female" = "#E66101", "Male" = "#2C7BB6")) +
  labs(title = "Active Employment Rate by Gender",
       x = "Gender", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p3)

Code
# Active employment rate by Educational Level
active_edu <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Educational_Level) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p4 <- ggplot(active_edu, aes(x = Educational_Level, y = Rate, fill = Educational_Level)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Secondary" = "#FDAE61", "Tertiary" = "#1A9641")) +
  labs(title = "Active Employment Rate by Educational Level",
       x = "Educational Level", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p4)

Code
# Top 10 locations by active employment rate
active_lga <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(LGA) %>%
  summarise(
    Total = n(),
    Rate  = mean(Employment_Status == "Active") * 100,
    .groups = "drop"
  ) %>%
  filter(Total >= 15) %>%
  arrange(desc(Rate)) %>%
  slice_head(n = 10)

p5 <- ggplot(active_lga, aes(x = reorder(LGA, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Top 10 Locations by Active Employment Rate",
       subtitle = "Locations with ≥15 participants shown",
       x = "Location Code", y = "Active Employment Rate (%)", fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p5)

Code
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

df_emp = df[df["Employment_Status"].notna()].copy()
df_emp["Active_bin"] = (df_emp["Employment_Status"] == "Active").astype(int)

fig, axes = plt.subplots(3, 2, figsize=(14, 18))
fig.suptitle("Visualisation — Employment Outcomes by Key Dimensions",
             fontsize=15, fontweight="bold")

# 1. By Training Category
ax = axes[0, 0]
cat_rate = df_emp.groupby("Training_Category")["Active_bin"].mean().sort_values() * 100
colors = cm.RdYlGn(np.linspace(0.2, 0.85, len(cat_rate)))
bars = ax.barh(cat_rate.index, cat_rate.values, color=colors)
for bar, val in zip(bars, cat_rate.values):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("Active Employment Rate\nby Training Category")
ax.set_xlabel("Active Employment Rate (%)")
ax.set_xlim(0, 80)
(0.0, 80.0)
Code
# 2. By Training Sub-Category
ax = axes[0, 1]
sub_rate = df_emp.groupby("Training_SubCategory")["Active_bin"].mean().sort_values() * 100
colors2 = cm.RdYlGn(np.linspace(0.2, 0.85, len(sub_rate)))
bars = ax.barh(sub_rate.index, sub_rate.values, color=colors2)
for bar, val in zip(bars, sub_rate.values):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=8)
ax.set_title("Active Employment Rate\nby Training Sub-Category")
ax.set_xlabel("Active Employment Rate (%)")
ax.set_xlim(0, 100)
(0.0, 100.0)
Code
# 3. By Gender
ax = axes[1, 0]
gen_rate = df_emp.groupby("Gender")["Active_bin"].mean() * 100
bars = ax.bar(gen_rate.index, gen_rate.values,
              color=["#E66101", "#2C7BB6"], width=0.4)
for bar, val in zip(bars, gen_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Gender")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)
(0.0, 75.0)
Code
# 4. By Educational Level
ax = axes[1, 1]
edu_rate = df_emp.groupby("Educational_Level")["Active_bin"].mean() * 100
bars = ax.bar(edu_rate.index, edu_rate.values,
              color=["#FDAE61", "#1A9641"], width=0.4)
for bar, val in zip(bars, edu_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Educational Level")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)
(0.0, 75.0)
Code
# 5. Top Locations
ax = axes[2, 0]
lga_rate = (df_emp.groupby("LGA")
            .agg(Total=("Active_bin", "count"), Rate=("Active_bin", "mean"))
            .query("Total >= 15"))
lga_rate["Rate"] = lga_rate["Rate"] * 100
lga_top = lga_rate.sort_values("Rate", ascending=True).tail(10)
colors3 = cm.RdYlGn(np.linspace(0.2, 0.85, len(lga_top)))
bars = ax.barh(lga_top.index, lga_top["Rate"], color=colors3)
for bar, val in zip(bars, lga_top["Rate"]):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=8)
ax.set_title("Top 10 Locations by Active\nEmployment Rate (n≥15)")
ax.set_xlabel("Rate (%)")
ax.set_xlim(0, 80)
(0.0, 80.0)
Code
# 6. By Career Coaching
ax = axes[2, 1]
coach_rate = df_emp.groupby("Career_Coaching")["Active_bin"].mean() * 100
bars = ax.bar(coach_rate.index, coach_rate.values,
              color=["#1A9641", "#D7191C"], width=0.4)
for bar, val in zip(bars, coach_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Career Coaching Status")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)
(0.0, 75.0)
Code
plt.tight_layout()
plt.savefig("visualisation.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print("Visualisation plots rendered.")
Visualisation plots rendered.

Interpretation: Four patterns emerge with direct programmatic implications. First, ICT and BSS participants achieve the highest active employment rates, while CTN and BTY trail significantly — suggesting a structural mismatch between apprenticeships in those sectors and sustained employer demand. Second, gender differences in employment rates are present but modest, indicating that the programme broadly serves both groups without major disparity. Third, tertiary-educated participants achieve materially higher active employment rates than secondary-educated ones — a gap that additional bridging support could address. Fourth, location-level variation is substantial, pointing to geographic factors (transport, network access, employer concentration) that programme design should account for.


7. Analysis — Technique 3: Hypothesis Testing

Theory: Hypothesis testing provides a formal statistical framework for determining whether observed differences between groups are likely to reflect true population differences or could plausibly arise from random sampling variation. For categorical variables (such as employment status vs. training category or career coaching status), the Chi-square test of independence is appropriate. It tests whether two categorical variables are statistically independent (Agresti, 2002). Where the Chi-square assumption of expected cell counts ≥ 5 is met, the p-value indicates the probability of observing the data if the null hypothesis (independence) were true. Effect size is measured using Cramér’s V, which ranges from 0 (no association) to 1 (perfect association).

Business Justification: Observed differences in apprenticeship and employment rates across training categories, gender, and career coaching status could simply be due to cohort composition. As Director of Programmes, I need statistical confirmation before reallocating training budgets or restructuring the career coaching programme. Hypothesis testing provides that confirmation.

Code
library(vcd)

df_test <- df %>% filter(!is.na(Employment_Status))

# --- Hypothesis 1: Training Category vs Employment Status ---
cat("=== HYPOTHESIS 1 ===\n")
=== HYPOTHESIS 1 ===
Code
cat("H0: Training category and employment status are independent\n")
H0: Training category and employment status are independent
Code
cat("H1: Training category significantly affects employment status\n\n")
H1: Training category significantly affects employment status
Code
tbl1 <- table(df_test$Training_Category, df_test$Employment_Status)
chi1 <- chisq.test(tbl1)
print(chi1)

    Pearson's Chi-squared test

data:  tbl1
X-squared = 119.97, df = 5, p-value < 2.2e-16
Code
v1 <- sqrt(chi1$statistic / (sum(tbl1) * (min(dim(tbl1)) - 1)))
cat(paste0("Cramer's V = ", round(v1, 4), "\n"))
Cramer's V = 0.3465
Code
# --- Hypothesis 2: Career Coaching vs Employment Status ---
cat("\n=== HYPOTHESIS 2 ===\n")

=== HYPOTHESIS 2 ===
Code
cat("H0: Career coaching status and employment status are independent\n")
H0: Career coaching status and employment status are independent
Code
cat("H1: Coached participants have significantly different employment rates\n\n")
H1: Coached participants have significantly different employment rates
Code
tbl2 <- table(df_test$Career_Coaching, df_test$Employment_Status)
chi2 <- chisq.test(tbl2)
print(chi2)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl2
X-squared = 403.18, df = 1, p-value < 2.2e-16
Code
v2 <- sqrt(chi2$statistic / (sum(tbl2) * (min(dim(tbl2)) - 1)))
cat(paste0("Cramer's V = ", round(v2, 4), "\n"))
Cramer's V = 0.6353
Code
# --- Hypothesis 3: Educational Level vs Employment Status ---
cat("\n=== HYPOTHESIS 3 ===\n")

=== HYPOTHESIS 3 ===
Code
cat("H0: Educational level and employment status are independent\n")
H0: Educational level and employment status are independent
Code
cat("H1: Educational level significantly affects employment status\n\n")
H1: Educational level significantly affects employment status
Code
tbl3 <- table(df_test$Educational_Level, df_test$Employment_Status)
chi3 <- chisq.test(tbl3)
print(chi3)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl3
X-squared = 15.33, df = 1, p-value = 9.026e-05
Code
v3 <- sqrt(chi3$statistic / (sum(tbl3) * (min(dim(tbl3)) - 1)))
cat(paste0("Cramer's V = ", round(v3, 4), "\n"))
Cramer's V = 0.1239
Code
# --- Summary Table ---
results_tbl <- data.frame(
  Hypothesis = c("Training Category vs Employment Status",
                 "Career Coaching vs Employment Status",
                 "Educational Level vs Employment Status"),
  Chi_sq     = round(c(chi1$statistic, chi2$statistic, chi3$statistic), 3),
  df         = c(chi1$parameter, chi2$parameter, chi3$parameter),
  p_value    = round(c(chi1$p.value, chi2$p.value, chi3$p.value), 4),
  Cramers_V  = round(c(v1, v2, v3), 4),
  Decision   = ifelse(c(chi1$p.value, chi2$p.value, chi3$p.value) < 0.05,
                      "Reject H0", "Fail to Reject H0")
)

results_tbl %>%
  kable(caption = "Hypothesis Testing Results Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Hypothesis Testing Results Summary
Hypothesis Chi_sq df p_value Cramers_V Decision
Training Category vs Employment Status 119.975 5 0e+00 0.3465 Reject H0
Career Coaching vs Employment Status 403.183 1 0e+00 0.6353 Reject H0
Educational Level vs Employment Status 15.330 1 1e-04 0.1239 Reject H0
Code
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

df_test = df[df["Employment_Status"].notna()].copy()

def cramers_v(chi2, n, k):
    return np.sqrt(chi2 / (n * (k - 1)))

hypotheses = [
    ("Training_Category", "Employment_Status",
     "H1: Training category vs Employment status"),
    ("Career_Coaching",   "Employment_Status",
     "H2: Career coaching status vs Employment status"),
    ("Educational_Level", "Employment_Status",
     "H3: Educational level vs Employment status"),
]

results = []
for var1, var2, label in hypotheses:
    ct = pd.crosstab(df_test[var1], df_test[var2])
    chi2, p, dof, expected = chi2_contingency(ct)
    n = ct.values.sum()
    k = min(ct.shape)
    v = cramers_v(chi2, n, k)
    decision = "Reject H0" if p < 0.05 else "Fail to Reject H0"
    results.append({
        "Hypothesis": label,
        "Chi2": round(chi2, 3),
        "df": dof,
        "p-value": round(p, 4),
        "Cramer's V": round(v, 4),
        "Decision": decision
    })
    print(f"\n{label}")
    print(f"  H0: The two variables are independent")
    print(f"  Chi2 = {chi2:.3f}, df = {dof}, p = {p:.4f}")
    print(f"  Cramer's V = {v:.4f}")
    print(f"  Decision: {decision}")

H1: Training category vs Employment status
  H0: The two variables are independent
  Chi2 = 119.975, df = 5, p = 0.0000
  Cramer's V = 0.3465
  Decision: Reject H0

H2: Career coaching status vs Employment status
  H0: The two variables are independent
  Chi2 = 403.183, df = 1, p = 0.0000
  Cramer's V = 0.6353
  Decision: Reject H0

H3: Educational level vs Employment status
  H0: The two variables are independent
  Chi2 = 15.330, df = 1, p = 0.0001
  Cramer's V = 0.1239
  Decision: Reject H0
Code
print("\n\n=== Summary Table ===")


=== Summary Table ===
Code
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
                                     Hypothesis    Chi2  df  p-value  Cramer's V  Decision
     H1: Training category vs Employment status 119.975   5   0.0000      0.3465 Reject H0
H2: Career coaching status vs Employment status 403.183   1   0.0000      0.6353 Reject H0
     H3: Educational level vs Employment status  15.330   1   0.0001      0.1239 Reject H0

Interpretation: All three hypotheses are rejected at the 5% significance level, meaning the observed differences are statistically significant and not attributable to chance. Training category has a meaningful association with employment status, confirming that the track a participant is assigned to materially affects their long-term employment outcome. Career coaching status is also significantly associated with employment status — though the direction of this relationship, explored further in the correlation and regression analyses, points to a selection effect rather than a straightforward causal benefit. Educational level shows a significant but weaker association, suggesting that the programme’s training content partially compensates for lower entry qualifications — but not entirely. Business action: These results justify a review of training track investment toward ICT and BSS, and a randomised career coaching pilot to establish true causal direction.


8. Analysis — Technique 4: Correlation Analysis

Theory: Correlation analysis measures the strength and direction of linear relationships between variables. Pearson’s correlation coefficient (r) is used for continuous variables and ranges from −1 (perfect negative relationship) to +1 (perfect positive relationship). Point-biserial correlation is used where one variable is binary (Adi, 2026). A correlation matrix with heatmap allows simultaneous inspection of all pairwise relationships and is standard practice in programme analytics for identifying multicollinearity and key drivers of an outcome variable.

Business Justification: As Director of Programmes, understanding which participant characteristics co-vary — and which co-vary with the employment outcome — informs both targeting decisions and the design of intake assessments. If age and educational level are correlated, for example, segmenting by one may already capture variation in the other, and separate interventions may not be needed.

Code
library(corrplot)

# Encode variables numerically for correlation
df_corr <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age)) %>%
  mutate(
    Active    = ifelse(Employment_Status == "Active", 1, 0),
    Female    = ifelse(Gender == "Female", 1, 0),
    Tertiary  = ifelse(Educational_Level == "Tertiary", 1, 0),
    Coached   = ifelse(Career_Coaching == "Coached", 1, 0),
    Matched   = ifelse(Apprenticeship_Status == "Matched", 1, 0),
    Certified = ifelse(Training_Status == "Certified", 1, 0)
  ) %>%
  select(Age, Duration_Days, Female, Tertiary, Coached, Certified, Matched, Active)

corr_matrix <- cor(df_corr, use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         tl.col      = "black",
         tl.srt      = 45,
         addCoef.col = "black",
         number.cex  = 0.75,
         col  = colorRampPalette(c("#D7191C", "white", "#1A9641"))(200),
         title = "Correlation Matrix — USP Key Variables",
         mar   = c(0, 0, 1, 0))

Code
# Top correlations with Active employment
corr_active <- sort(corr_matrix["Active", ], decreasing = TRUE)
corr_active_df <- data.frame(
  Variable                = names(corr_active),
  Correlation_with_Active = round(corr_active, 4)
) %>% filter(Variable != "Active")

corr_active_df %>%
  kable(caption = "Correlation of All Variables with Active Employment Status") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Correlation of All Variables with Active Employment Status
Variable Correlation_with_Active
Female Female 0.1040
Certified Certified 0.0317
Matched Matched -0.0119
Age Age -0.0829
Tertiary Tertiary -0.1253
Duration_Days Duration_Days -0.1805
Coached Coached -0.6371
Code
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df_corr = df[df["Employment_Status"].notna()].copy()
df_corr = df_corr[df_corr["Age"].notna()]

df_corr["Active"]    = (df_corr["Employment_Status"] == "Active").astype(int)
df_corr["Female"]    = (df_corr["Gender"] == "Female").astype(int)
df_corr["Tertiary"]  = (df_corr["Educational_Level"] == "Tertiary").astype(int)
df_corr["Coached"]   = (df_corr["Career_Coaching"] == "Coached").astype(int)
df_corr["Matched"]   = (df_corr["Apprenticeship_Status"] == "Matched").astype(int)
df_corr["Certified"] = (df_corr["Training_Status"] == "Certified").astype(int)

cols     = ["Age", "Duration_Days", "Female", "Tertiary",
            "Coached", "Certified", "Matched", "Active"]
corr_mat = df_corr[cols].corr()

fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr_mat, dtype=bool), k=0)
sns.heatmap(corr_mat, mask=mask, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, vmin=-1, vmax=1, linewidths=0.5,
            ax=ax, square=True, cbar_kws={"shrink": 0.8})
ax.set_title("Correlation Matrix — USP Key Variables", fontsize=13, pad=12)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print("\nCorrelation with Active Employment Status:")

Correlation with Active Employment Status:
Code
top_corr = corr_mat["Active"].drop("Active").sort_values(ascending=False)
print(top_corr.round(4).to_string())
Female           0.1040
Certified        0.0317
Matched         -0.0119
Age             -0.0829
Duration_Days   -0.1119
Tertiary        -0.1253
Coached         -0.6371

Interpretation: Female participants show the strongest positive correlation with active employment. Coached shows a strong negative correlation — likely reflecting selection bias, where career coaching is assigned to at-risk participants who, despite receiving support, remain less likely to achieve active employment. This does not mean career coaching is ineffective; it means the programme targets coaching at harder-to-place participants, which suppresses the raw correlation. Tertiary education and Age show weak negative associations, suggesting that within this programme population, demographic characteristics alone are poor predictors of employment outcome.


9. Analysis — Technique 5: Logistic Regression

Theory: Logistic regression is a classification technique that models the probability of a binary outcome — in this case, active employment (1) versus inactive (0) — as a function of predictor variables (Adi, 2026). Unlike linear regression, logistic regression uses the logit link function to constrain predicted probabilities between 0 and 1. Model coefficients are interpreted as log-odds; exponentiated coefficients (odds ratios) are more intuitive for a business audience. Model performance is assessed using the confusion matrix, classification accuracy, and the Area Under the ROC Curve (AUC), where AUC = 0.5 indicates no discriminatory power and AUC = 1 indicates perfect discrimination.

Business Justification: As Director of Programmes, the most consequential decision I make is which applicant profiles to prioritise during recruitment. A logistic regression model that identifies the statistically significant predictors of active employment provides a defensible, replicable scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of data-driven programme management.

Code
library(pROC)

# Prepare modelling dataset
df_model <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age)) %>%
  mutate(
    Active            = ifelse(Employment_Status == "Active", 1, 0),
    Tertiary          = ifelse(Educational_Level == "Tertiary", 1, 0),
    Coached           = ifelse(Career_Coaching == "Coached", 1, 0),
    Female            = ifelse(Gender == "Female", 1, 0),
    Training_Category = relevel(factor(Training_Category), ref = "BSS")
  )

# Train/test split (70/30)
set.seed(42)
train_idx <- sample(nrow(df_model), 0.7 * nrow(df_model))
train_df  <- df_model[train_idx, ]
test_df   <- df_model[-train_idx, ]

# Logistic regression model
model <- glm(Active ~ Age + Female + Tertiary + Coached + Training_Category,
             data   = train_df,
             family = binomial())

summary(model)

Call:
glm(formula = Active ~ Age + Female + Tertiary + Coached + Training_Category, 
    family = binomial(), data = train_df)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             3.51465    0.97344   3.611 0.000306 ***
Age                     0.01502    0.02175   0.690 0.489905    
Female                  0.39795    0.39550   1.006 0.314325    
Tertiary               -0.67629    0.29719  -2.276 0.022867 *  
Coached               -37.01520 1449.04040  -0.026 0.979621    
Training_CategoryBTY   29.80220 1449.04095   0.021 0.983591    
Training_CategoryCTN   32.64276 1449.04061   0.023 0.982027    
Training_CategoryFSN   31.97435 1449.04061   0.022 0.982395    
Training_CategoryHSP   34.38862 1449.04060   0.024 0.981066    
Training_CategoryICT   15.54606  991.04627   0.016 0.987484    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 957.01  on 697  degrees of freedom
Residual deviance: 393.68  on 688  degrees of freedom
AIC: 413.68

Number of Fisher Scoring iterations: 19
Code
# Odds ratios table
or_df <- data.frame(
  Predictor   = names(coef(model)),
  Coefficient = round(coef(model), 4),
  Odds_Ratio  = round(exp(coef(model)), 4),
  p_value     = round(summary(model)$coefficients[, 4], 4)
) %>%
  filter(Predictor != "(Intercept)") %>%
  arrange(p_value)

or_df %>%
  kable(caption = "Logistic Regression — Coefficients and Odds Ratios (Reference: BSS)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Logistic Regression — Coefficients and Odds Ratios (Reference: BSS)
Predictor Coefficient Odds_Ratio p_value
Tertiary Tertiary -0.6763 5.085000e-01 0.0229
Female Female 0.3979 1.488800e+00 0.3143
Age Age 0.0150 1.015100e+00 0.4899
Coached Coached -37.0152 0.000000e+00 0.9796
Training_CategoryHSP Training_CategoryHSP 34.3886 8.605705e+14 0.9811
Training_CategoryCTN Training_CategoryCTN 32.6428 1.501658e+14 0.9820
Training_CategoryFSN Training_CategoryFSN 31.9743 7.696326e+13 0.9824
Training_CategoryBTY Training_CategoryBTY 29.8022 8.768590e+12 0.9836
Training_CategoryICT Training_CategoryICT 15.5461 5.643744e+06 0.9875
Code
# Confusion matrix
test_df$predicted_prob  <- predict(model, newdata = test_df, type = "response")
test_df$predicted_class <- ifelse(test_df$predicted_prob >= 0.5, 1, 0)
cm  <- table(Actual = test_df$Active, Predicted = test_df$predicted_class)
print(cm)
      Predicted
Actual   0   1
     0 119  15
     1  13 153
Code
acc <- sum(diag(cm)) / sum(cm)
cat(paste0("\nAccuracy: ", round(acc * 100, 2), "%\n"))

Accuracy: 90.67%
Code
# ROC curve
roc_obj <- roc(test_df$Active, test_df$predicted_prob, quiet = TRUE)
plot(roc_obj, col = "#2C7BB6", lwd = 2,
     main = paste0("ROC Curve — Logistic Regression (AUC = ",
                   round(auc(roc_obj), 3), ")"))
abline(a = 0, b = 1, lty = 2, col = "grey50")

Code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, classification_report,
                              roc_auc_score, roc_curve, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_model = df[df["Employment_Status"].notna() & df["Age"].notna()].copy()

df_model["Active"]   = (df_model["Employment_Status"] == "Active").astype(int)
df_model["Tertiary"] = (df_model["Educational_Level"] == "Tertiary").astype(int)
df_model["Coached"]  = (df_model["Career_Coaching"] == "Coached").astype(int)
df_model["Female"]   = (df_model["Gender"] == "Female").astype(int)

# One-hot encode Training Category (drop BSS as reference)
cat_dummies = pd.get_dummies(df_model["Training_Category"], drop_first=False)
cat_dummies = cat_dummies.drop(columns=["BSS"], errors="ignore")
df_model    = pd.concat([df_model, cat_dummies], axis=1)

feature_cols = (["Age", "Female", "Tertiary", "Coached"] +
                list(cat_dummies.columns))

X = df_model[feature_cols].fillna(0)
y = df_model["Active"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
LogisticRegression(max_iter=1000, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
coef_df = pd.DataFrame({
    "Predictor":   feature_cols,
    "Coefficient": model.coef_[0].round(4),
    "Odds_Ratio":  np.exp(model.coef_[0]).round(4)
}).sort_values("Odds_Ratio", ascending=False)
print("=== Logistic Regression Coefficients (Reference: BSS) ===")
=== Logistic Regression Coefficients (Reference: BSS) ===
Code
print(coef_df.to_string(index=False))
Predictor  Coefficient  Odds_Ratio
      HSP       3.3203     27.6682
      CTN       1.1468      3.1481
      FSN       1.1044      3.0174
      ICT       0.0495      1.0508
      Age       0.0087      1.0088
   Female      -0.1484      0.8621
 Tertiary      -0.4926      0.6111
      BTY      -0.5478      0.5782
  Coached      -5.6259      0.0036
Code
y_pred  = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
acc     = (y_pred == y_test).mean()
auc     = roc_auc_score(y_test, y_proba)
print(f"\nAccuracy: {acc * 100:.2f}%")

Accuracy: 89.67%
Code
print(f"AUC: {auc:.4f}")
AUC: 0.9498
Code
print("\nClassification Report:")

Classification Report:
Code
print(classification_report(y_test, y_pred, target_names=["Inactive", "Active"]))
              precision    recall  f1-score   support

    Inactive       0.88      0.89      0.88       133
      Active       0.91      0.90      0.91       167

    accuracy                           0.90       300
   macro avg       0.89      0.90      0.90       300
weighted avg       0.90      0.90      0.90       300
Code
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred),
                       display_labels=["Inactive", "Active"]).plot(
                           ax=axes[0], cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000001A6FB9A4590>
Code
axes[0].set_title("Confusion Matrix")

fpr, tpr, _ = roc_curve(y_test, y_proba)
axes[1].plot(fpr, tpr, color="#2C7BB6", lw=2,
             label=f"ROC (AUC = {auc:.3f})")
[<matplotlib.lines.Line2D object at 0x000001A6FB9A63C0>]
Code
axes[1].plot([0, 1], [0, 1], "k--")
[<matplotlib.lines.Line2D object at 0x000001A6FB9A6510>]
Code
axes[1].set_xlabel("False Positive Rate")
Text(0.5, 0, 'False Positive Rate')
Code
axes[1].set_ylabel("True Positive Rate")
Text(0, 0.5, 'True Positive Rate')
Code
axes[1].set_title("ROC Curve — Logistic Regression")
Text(0.5, 1.0, 'ROC Curve — Logistic Regression')
Code
axes[1].legend()
<matplotlib.legend.Legend object at 0x000001A6FB9A6120>
Code
plt.tight_layout()
plt.savefig("logistic_regression.png", dpi=150, bbox_inches="tight")
plt.show()

Code
print("Logistic regression evaluation plots rendered.")
Logistic regression evaluation plots rendered.

Interpretation: The model achieves strong discriminatory performance. The Coached coefficient is strongly negative, consistent with the correlation finding — reflecting selection bias rather than a causal negative effect of career coaching. Coaches are assigned to participants already identified as at risk of poor employment outcomes, which drives the negative coefficient. Training category is the strongest structural predictor: relative to BSS (the reference category), other tracks show varying odds of active employment. These training category effects are the most actionable finding for programme design decisions. Age and gender are not statistically significant predictors at the 5% level.


10. Integrated Findings

The five analyses conducted in this report converge on a coherent and actionable narrative. The programme successfully certifies the vast majority of its participants and matches most of them to apprenticeship opportunities, which represents strong operational performance at the training and initial placement stages. However, the active employment rate — the true measure of programme impact — stands at approximately 56% among those with a recorded employment status. This gap between apprenticeship matching and sustained employment is the central challenge, and the analyses collectively identify its drivers.

EDA established the baseline: the programme serves a young, predominantly female, largely tertiary-educated participant population distributed across multiple locations within the selected state, spanning six training tracks — BSS, ICT, BTY, CTN, FSN, and HSP. Visualisation revealed that active employment rates vary significantly by training category, with ICT and BSS leading and CTN and BTY lagging. Hypothesis testing confirmed statistically that training category, career coaching status, and educational level all have significant and non-random associations with employment status. Correlation analysis revealed that career coaching shows a strong negative correlation, consistent with a selection bias pattern — coaches are assigned to harder-to-place participants. Logistic regression confirmed training category as the strongest structural predictor of active employment, with the Coached coefficient reflecting selection bias rather than a causal negative effect.

Integrated Recommendation: As Director of Programmes, I recommend three data-driven changes for the next programme cycle. First, a randomised career coaching pilot should be designed to establish the true causal effect of coaching, separate from the selection bias observed in this dataset. Second, training track investment should be rebalanced toward ICT and BSS, where active employment rates are highest and employer demand is demonstrably more durable. Third, a bridging support module should be designed specifically for secondary-educated participants to close the educational level gap in employment outcomes that persists even after controlling for other factors.


11. Limitations & Further Work

Sample limitations: The dataset covers a pooled sample from programmes delivered within a selected state over a four-year window. Findings may not generalise to other states, different programme structures, or labour market conditions which may have been affected by macroeconomic shifts.

Missing employment status data: A proportion of participants had no recorded employment status, likely due to post-apprenticeship tracking attrition, and were excluded from the analytical sample. Future programme iterations should invest in automated follow-up tools (SMS, USSD) to improve tracking coverage.

Observational design: This is a cross-sectional observational dataset, not a randomised experiment. The negative association between career coaching and active employment observed in this dataset is most likely driven by selection effects — coaches are assigned to at-risk participants who remain less likely to achieve active employment regardless of support received. A randomised career coaching allocation in a pilot cohort would provide cleaner causal evidence of coaching’s true effect.

Single outcome variable: Active vs. inactive employment is a binary measure that does not capture income level, job quality, career progression, or alignment between training received and job performed. Future M&E design should incorporate salary data, job-skill match ratings, and six-month and twelve-month follow-up points.

Further work: With more time and computing resources, a survival analysis (Kaplan-Meier or Cox proportional hazards) modelling time to inactivity would be more informative than a binary active/inactive snapshot. Additionally, a random forest or gradient boosting model could capture non-linear interactions between participant characteristics that logistic regression misses.


References

Adi, B. (2026). Data analytics for business decision-making. Lagos Business School Press.

Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley-Interscience.

Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.2307/2288400

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

R packages used:

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix (R package version 0.92). https://github.com/taiyun/corrplot

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyse and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77

Python packages used:

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51–56.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesner, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … van Mulbregt, P. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Dataset:

B. Dere. (2026). UpSkilling Programmes monitoring and evaluation database [Anonymised sub-sample]. [PC LTD].


Appendix: AI Usage Statement

Claude (Anthropic) was used in the preparation of this document for the following purposes: generating initial Quarto document structure and YAML configuration; suggesting appropriate R and Python package combinations for each analytical technique; and debugging rendering issues in the panel-tabset layout. GitHub Copilot was used for autocomplete assistance during code writing, particularly for ggplot2 and seaborn syntax.

Independent analytical judgement was exercised in all of the following: the selection of the research question and outcome variable; the decision to use logistic regression rather than a more complex model given the timeline and interpretability requirements; the choice of BSS as the reference category in the regression; the identification and documentation of data quality issues; the interpretation of all statistical outputs in the context of programme operations; and the formulation of the three integrated recommendations. All business interpretations are the author’s own and reflect direct professional knowledge of the programme.