Predicting Active Employment Outcomes in UpSkilling Programmes

1. Executive Summary

This study analyses data from various vocational training and apprenticeship initiatives that enrolled approximately 1,000 participants across multiple cohorts between 2021 and 2024. The central business question is: which participant profiles and training categories are most likely to result in active employment, and how can this knowledge make programme design, trainer allocation, and apprenticeship partnership decisions more data-driven?

Data were extracted from a pooled anonymised programme database covering 20 variables including demographic characteristics, training track, career coaching status, certification outcome, apprenticeship status, and final employment status. Six training categories were covered: Business Support Services (BSS), Information and Communications Technology (ICT), Beauty (BTY), Construction (CTN), Fashion (FSN), and Hospitality (HSP). Five analytical techniques were applied: Exploratory Data Analysis (EDA), Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression. Key findings reveal that training category and educational level are the strongest structural predictors of active employment, while career coaching assignment reflects a selection effect that warrants further causal investigation. Participants in ICT and BSS tracks and those with tertiary education consistently show higher rates of active employment. The integrated recommendation is that future cohorts should invest in ICT and BSS tracks and design a randomised career coaching pilot to establish true causal effect.

2. Professional Disclosure

Job Title: Director of Programmes
Organisation/Sector: PC LTD | Skills Development / Social Impact

Exploratory Data Analysis (EDA): As Director of Programmes, understanding who is enrolling in the programme before making design decisions is foundational. EDA allows me to monitor whether the programme is reaching its intended demographic — young, low-income residents within the state — and to detect shifts in participant profiles across cohorts that would require a programmatic response. Without this diagnostic layer, decisions on recruitment targeting and resource allocation are based on assumption rather than evidence.

Visualisation: Reporting to funders, government partners, and the board requires communicating programme performance clearly and efficiently. Visualisation of apprenticeship rates by training category, gender, and location gives me a dashboard-ready view that directly informs which tracks to scale, discontinue, or redesign in future cohorts. It also enables me to present evidence to partners in a form that drives decisions rather than simply informs them.

Hypothesis Testing: A recurring operational question I face is whether observed differences in outcomes — for example, whether coached participants achieve employment at higher rates than uncoached ones, or whether one training category outperforms another — are statistically real or simply sampling noise. Hypothesis testing provides the rigorous basis needed to act on those differences, justify resource reallocation, and defend programmatic changes to stakeholders.

Correlation Analysis: As Director, I allocate career coaching resources and decide which participant segments receive intensive support. Correlation analysis between age, educational level, training duration, and employment outcomes allows me to test whether those targeting assumptions are evidence-based or inherited from previous programme designs. Understanding which variables move together informs more precise targeting in future cohorts.

Logistic Regression: The most consequential decision I make each cycle is which applicant profiles to prioritise during recruitment. A logistic regression model identifying the strongest predictors of active employment provides a defensible, data-driven scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of systematic programme improvement.

3. Data Collection & Sampling

Source: UpSkilling Programmes pooled anonymised monitoring and evaluation (M&E) database, maintained by the Directorate of Programmes.

Collection Method: The dataset was extracted from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a state within Nigeria. No third-party data was used.

Sampling Frame: Participants registered across multiple programme cohorts between 2021 and 2024. A stratified random sub-sample of 1,000 participants was drawn to preserve the Active/Inactive employment ratio of the full population while protecting the identity of any individual cohort or project.

Sample Size: 1,000 observations (participants) across 20 variables.

Time Period: January 2021 to December 2024 (approximately 4 years).

Variables: The dataset includes career coaching status, enrollment ID, qualification, educational level, age, age bracket, location code, marital status, gender, training category, training sub-category, programme start and end dates, training status, apprenticeship status, reason if declined, date of employment, employment type, employment status, and job formality classification.

Training Category Codes: BSS = Business Support Services; ICT = Information and Communications Technology; BTY = Beauty; CTN = Construction; FSN = Fashion; HSP = Hospitality.

Training Sub-Category Codes: ABG = Agent Banking; BC = Bakery and Confectioneries; FSN = Fashion; DGM = Digital Marketing; ELC = Electrical; PLM = Plumbing; CSM = Cosmetology; HDR = Hair Dressing; TLG = Tiling; CSR = Customer Service; MSN = Masonry; S&M = Sales and Marketing.

Ethical & Consent Notes: The dataset used in this analysis is a randomly drawn sub-sample from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a Nigerian state between 2021 and 2024. All personally identifiable information — including names, national identification numbers, phone numbers, and employer details — has been removed prior to extraction. Enrollment IDs are synthetic identifiers assigned for analytical purposes only and do not correspond to any original programme record. Geographic identifiers have been replaced with randomised location codes (e.g. LGA 14, LGA 80, LGA 144) that do not correspond to any named administrative unit and do not reveal the state or region of programme delivery. The pooled dataset design means that no single project, cohort, or organisation can be identified from the analytical extract. This analysis has been conducted in accordance with the Nigerian Data Protection Act 2023 (NDPA) and the data governance principles of the originating organisation.

Data Quality Issues Identified: 1. Missing Employment Status: Participants with no recorded employment status were excluded from the analytical sample. These are predominantly participants still within the programme cycle at the time of data extraction or those who became unreachable during follow-up. 2. Age Outlier: One participant is recorded as age 1, which is clearly a data entry error. This record is excluded from age-related analyses.

4. Data Description

Code

# Install any missing packages silently
pkgs <- c("tidyverse","skimr","knitr","kableExtra","lubridate",
          "ggplot2","scales","vcd","corrplot","pROC")
new_pkgs <- pkgs[!pkgs %in% installed.packages()[,"Package"]]
if (length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE)

# Load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(kableExtra)
library(lubridate)

# Load data
df <- read.csv("Data_R_fixed.csv", stringsAsFactors = FALSE)

# Fix: replace blank Employment_Status with NA
df$Employment_Status[df$Employment_Status == ""] <- NA

# Parse dates
df$Start_Date    <- as.Date(parse_date_time(df$Start_Date,
                    orders = c("ymd", "dmy", "mdy", "ymd HMS")))
df$End_Date      <- as.Date(parse_date_time(df$End_Date,
                    orders = c("ymd", "dmy", "mdy", "ymd HMS")))
df$Duration_Days <- as.numeric(df$End_Date - df$Start_Date)

# Remove obvious age error
df <- df %>% filter(is.na(Age) | Age > 5)

# Summary table of key numeric variables
df %>%
  select(Age, Duration_Days) %>%
  summary() %>%
  kable(caption = "Summary Statistics — Numeric Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Summary Statistics — Numeric Variables
	Age	Duration_Days
	Min. :18.00	Min. : 18.00
	1st Qu.:25.00	1st Qu.: 28.00
	Median :30.00	Median : 56.00
	Mean :30.14	Mean : 53.98
	3rd Qu.:35.00	3rd Qu.: 84.00
	Max. :66.00	Max. :109.00
	NA's :1	NA

Code

# Categorical summaries
cat_summary <- data.frame(
  Variable = c("Gender", "Educational Level", "Training Status",
               "Apprenticeship Status", "Employment Status", "Career Coaching"),
  Categories = c(
    paste(names(table(df$Gender)),                 collapse = " | "),
    paste(names(table(df$Educational_Level)),       collapse = " | "),
    paste(names(table(df$Training_Status)),         collapse = " | "),
    paste(names(table(df$Apprenticeship_Status)),   collapse = " | "),
    paste(names(table(df$Employment_Status)),       collapse = " | "),
    paste(names(table(df$Career_Coaching)),         collapse = " | ")
  ),
  N_Valid = c(
    sum(!is.na(df$Gender)),
    sum(!is.na(df$Educational_Level)),
    sum(!is.na(df$Training_Status)),
    sum(!is.na(df$Apprenticeship_Status)),
    sum(!is.na(df$Employment_Status)),
    sum(!is.na(df$Career_Coaching))
  )
)

cat_summary %>%
  kable(caption = "Categorical Variable Overview") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Categorical Variable Overview
Variable	Categories	N_Valid
Gender	Female \| Male	999
Educational Level	Secondary \| Tertiary	999
Training Status	Certified \| Not Certified	999
Apprenticeship Status	Matched \| Self-Directed \| Unmatched	999
Employment Status	Active \| Inactive	999
Career Coaching	Coached \| Not Coached	999

Code

import pandas as pd
import numpy as np

df = pd.read_csv("Data_R_fixed.csv", encoding="latin1")

# Fix blank Employment_Status
df["Employment_Status"] = df["Employment_Status"].replace("", float("nan"))

# Parse dates
df["Start_Date"]    = pd.to_datetime(df["Start_Date"], errors="coerce")
df["End_Date"]      = pd.to_datetime(df["End_Date"],   errors="coerce")
df["Duration_Days"] = (df["End_Date"] - df["Start_Date"]).dt.days

# Remove age outlier
df = df[(df["Age"].isna()) | (df["Age"] > 5)]

# Numeric summary
numeric_summary = df[["Age","Duration_Days"]].describe().round(2)
print("=== Numeric Variable Summary ===")

=== Numeric Variable Summary ===

Code

print(numeric_summary.to_string())

          Age  Duration_Days
count  998.00         876.00
mean    30.14          52.83
std      6.90          28.36
min     18.00          18.00
25%     25.00          27.00
50%     30.00          56.00
75%     35.00          85.00
max     66.00         109.00

Code

# Categorical summary
print("\n=== Categorical Variable Counts ===")


=== Categorical Variable Counts ===

Code

for col in ["Gender","Educational_Level","Training_Status","Apprenticeship_Status",
            "Employment_Status","Career_Coaching","Training_Category","Training_SubCategory"]:
    print(f"\n{col}:")
    print(df[col].value_counts(dropna=False).to_string())


Gender:
Gender
Female    606
Male      393

Educational_Level:
Educational_Level
Tertiary     653
Secondary    346

Training_Status:
Training_Status
Certified        987
Not Certified     12

Apprenticeship_Status:
Apprenticeship_Status
Matched          996
Self-Directed      2
Unmatched          1

Employment_Status:
Employment_Status
Active      558
Inactive    441

Career_Coaching:
Career_Coaching
Coached        648
Not Coached    351

Training_Category:
Training_Category
BSS    231
CTN    180
FSN    179
HSP    179
ICT    138
BTY     92

Training_SubCategory:
Training_SubCategory
ABG    201
FSN    179
BC     179
DGM    138
ELC     83
PLM     49
CSM     48
HDR     44
TLG     34
CSR     18
MSN     14
S&M     12

Interpretation: The dataset covers approximately 1,000 participants with a mean age of approximately 30 years, reflecting a young workforce target population. Female participants (approximately 61%) outnumber male participants (39%). The majority hold tertiary qualifications and over 90% achieved certification. Approximately 56% of participants in the sample are actively employed at the time of data extraction, pointing to a significant post-apprenticeship retention or engagement challenge that this analysis seeks to understand. The six training categories — BSS, ICT, BTY, CTN, FSN, and HSP — span a broad range of vocational skills, with BSS being the largest track by enrolment.

5. Analysis — Technique 1: Exploratory Data Analysis

Theory: Exploratory Data Analysis (EDA) is the practice of summarising a dataset’s main characteristics, often using statistical and visual methods, before formal modelling. It is grounded in the tradition established by Tukey (1977) and is used to detect patterns, anomalies, and relationships that guide subsequent analysis. EDA is particularly important for administrative datasets where data quality issues are common and assumptions about the population may not hold.

Business Justification: As Director of Programmes, EDA provides the foundation for all downstream decisions. Before allocating resources, adjusting training tracks, or redesigning recruitment, I need to know who is actually in the programme, how they are distributed across demographic categories, and where the data has gaps. EDA surfaces these facts systematically rather than relying on anecdotal reports from field staff.

Code

library(ggplot2)
library(scales)

# Age distribution
p1 <- ggplot(df %>% filter(!is.na(Age)), aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#2C7BB6", colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = mean(Age, na.rm = TRUE)),
             colour = "#D7191C", linetype = "dashed", linewidth = 1) +
  labs(title = "Age Distribution of USP Participants",
       subtitle = "Dashed line = mean age (≈30 years)",
       x = "Age (years)", y = "Count") +
  theme_minimal(base_size = 13)
print(p1)

Code

# Apprenticeship status distribution
app_counts <- df %>%
  count(Apprenticeship_Status) %>%
  mutate(pct = n / sum(n) * 100)

p2 <- ggplot(app_counts,
             aes(x = reorder(Apprenticeship_Status, -n), y = pct,
                 fill = Apprenticeship_Status)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, size = 4) +
  labs(title = "Apprenticeship Status of USP Participants",
       x = "Apprenticeship Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p2)

Code

# Employment status breakdown
emp_counts <- df %>%
  filter(!is.na(Employment_Status)) %>%
  count(Employment_Status) %>%
  mutate(pct = n / sum(n) * 100)

p3 <- ggplot(emp_counts, aes(x = Employment_Status, y = pct, fill = Employment_Status)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("Active" = "#1A9641", "Inactive" = "#D7191C")) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, size = 4) +
  labs(title = "Employment Status Among USP Participants",
       subtitle = "Participants with no recorded employment status excluded",
       x = "Employment Status", y = "Percentage (%)") +
  theme_minimal(base_size = 13)
print(p3)

Code

# Training category distribution
cat_counts <- df %>%
  count(Training_Category) %>%
  mutate(pct = n / sum(n) * 100)

p4 <- ggplot(cat_counts, aes(x = reorder(Training_Category, pct), y = pct, fill = pct)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(round(pct, 1), "%")), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#2C7BB6") +
  labs(title = "Participant Distribution by Training Category",
       subtitle = "BSS=Business Support Services, ICT=Information & Comms Tech,\nBTY=Beauty, CTN=Construction, FSN=Fashion, HSP=Hospitality",
       x = "Training Category", y = "Percentage (%)") +
  theme_minimal(base_size = 13) +
  ylim(0, 35)
print(p4)

Code

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("EDA — USP Programme Overview", fontsize=15, fontweight="bold", y=1.01)

# Age distribution
ax = axes[0, 0]
age_clean = df["Age"].dropna()
ax.hist(age_clean, bins=15, color="#2C7BB6", edgecolor="white", alpha=0.85)
ax.axvline(age_clean.mean(), color="#D7191C", linestyle="--", linewidth=1.5)
ax.set_title("Age Distribution")
ax.set_xlabel("Age (years)")
ax.set_ylabel("Count")

# Training Category distribution
ax = axes[0, 1]
cat_counts = df["Training_Category"].value_counts()
cat_pct = cat_counts / cat_counts.sum() * 100
bars = ax.barh(cat_pct.index, cat_pct.values, color="#2C7BB6")
for bar, val in zip(bars, cat_pct.values):
    ax.text(val + 0.3, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("Training Category Distribution\n(BSS/ICT/BTY/CTN/FSN/HSP)")
ax.set_xlabel("Percentage (%)")

# Apprenticeship Status
ax = axes[1, 0]
pl_counts = df["Apprenticeship_Status"].value_counts()
pl_pct    = pl_counts / pl_counts.sum() * 100
bars = ax.bar(pl_pct.index, pl_pct.values, color="#2C7BB6")
for bar, val in zip(bars, pl_pct.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Apprenticeship Status (%)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")
ax.tick_params(axis="x", rotation=15)

# Employment Status
ax = axes[1, 1]
emp     = df["Employment_Status"].dropna().value_counts()
emp_pct = emp / emp.sum() * 100
bars = ax.bar(emp_pct.index, emp_pct.values,
              color=["#1A9641" if x == "Active" else "#D7191C"
                     for x in emp_pct.index])
for bar, val in zip(bars, emp_pct.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=9)
ax.set_title("Employment Status (Recorded Only)")
ax.set_xlabel("Status")
ax.set_ylabel("Percentage")

plt.tight_layout()
plt.savefig("eda_overview.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("EDA plots rendered.")

EDA plots rendered.

Interpretation: The EDA reveals four critical programme facts. First, the participant population is young (mean age ≈ 30 years), consistent with the programme’s youth employment mandate, though ages range up to 66, suggesting some adult re-skilling participation. Second, certification rates are high (above 90%), indicating that training delivery is effective at getting participants to completion. Third, the apprenticeship infrastructure is functioning well — the more pressing issue is retention: among those with a recorded employment status, approximately 56% are active and 44% are inactive. This active employment gap is the central problem this analysis seeks to explain. BSS is the largest training track by enrolment, followed by CTN and FSN.

6. Analysis — Technique 2: Visualisation

Theory: Data visualisation translates analytical findings into perceptual representations that allow patterns to be understood more rapidly than through tables alone. Effective visualisation for programme analytics follows the principle of encoding the most important comparison as the primary visual channel (Cleveland & McGill, 1984). For categorical outcome data such as employment status, bar charts with percentage encoding are the most reliable form.

Business Justification: As Director of Programmes, visualisation is the primary tool for communicating performance to funders, government partners, and the board. The plots below are designed to answer the most frequently asked questions in programme review meetings: which training track performs best, does gender affect outcomes, does education level matter, and which locations are underperforming?

Code

# Active employment rate by Training Category
active_cat <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Training_Category) %>%
  summarise(
    Total  = n(),
    Active = sum(Employment_Status == "Active"),
    Rate   = Active / Total * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(Rate))

p1 <- ggplot(active_cat,
             aes(x = reorder(Training_Category, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Active Employment Rate by Training Category",
       subtitle = "BSS=Business Support Services  ICT=Info & Comms Tech\nBTY=Beauty  CTN=Construction  FSN=Fashion  HSP=Hospitality",
       x = "Training Category", y = "Active Employment Rate (%)",
       fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p1)

Code

# Active employment rate by Training Sub-Category
active_subcat <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Training_SubCategory) %>%
  summarise(
    Total  = n(),
    Active = sum(Employment_Status == "Active"),
    Rate   = Active / Total * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(Rate))

p2 <- ggplot(active_subcat,
             aes(x = reorder(Training_SubCategory, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Active Employment Rate by Training Sub-Category",
       x = "Sub-Category", y = "Active Employment Rate (%)", fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 100)
print(p2)

Code

# Active employment rate by Gender
active_gender <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Gender) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p3 <- ggplot(active_gender, aes(x = Gender, y = Rate, fill = Gender)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Female" = "#E66101", "Male" = "#2C7BB6")) +
  labs(title = "Active Employment Rate by Gender",
       x = "Gender", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p3)

Code

# Active employment rate by Educational Level
active_edu <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(Educational_Level) %>%
  summarise(Rate = mean(Employment_Status == "Active") * 100, .groups = "drop")

p4 <- ggplot(active_edu, aes(x = Educational_Level, y = Rate, fill = Educational_Level)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), vjust = -0.5, size = 4.5) +
  scale_fill_manual(values = c("Secondary" = "#FDAE61", "Tertiary" = "#1A9641")) +
  labs(title = "Active Employment Rate by Educational Level",
       x = "Educational Level", y = "Active Employment Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 75)
print(p4)

Code

# Top 10 locations by active employment rate
active_lga <- df %>%
  filter(!is.na(Employment_Status)) %>%
  group_by(LGA) %>%
  summarise(
    Total = n(),
    Rate  = mean(Employment_Status == "Active") * 100,
    .groups = "drop"
  ) %>%
  filter(Total >= 15) %>%
  arrange(desc(Rate)) %>%
  slice_head(n = 10)

p5 <- ggplot(active_lga, aes(x = reorder(LGA, Rate), y = Rate, fill = Rate)) +
  geom_col() +
  geom_text(aes(label = paste0(round(Rate, 1), "%")), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_gradient(low = "#FDAE61", high = "#1A9641") +
  labs(title = "Top 10 Locations by Active Employment Rate",
       subtitle = "Locations with ≥15 participants shown",
       x = "Location Code", y = "Active Employment Rate (%)", fill = "Rate (%)") +
  theme_minimal(base_size = 12) +
  ylim(0, 80)
print(p5)

Code

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

df_emp = df[df["Employment_Status"].notna()].copy()
df_emp["Active_bin"] = (df_emp["Employment_Status"] == "Active").astype(int)

fig, axes = plt.subplots(3, 2, figsize=(14, 18))
fig.suptitle("Visualisation — Employment Outcomes by Key Dimensions",
             fontsize=15, fontweight="bold")

# 1. By Training Category
ax = axes[0, 0]
cat_rate = df_emp.groupby("Training_Category")["Active_bin"].mean().sort_values() * 100
colors = cm.RdYlGn(np.linspace(0.2, 0.85, len(cat_rate)))
bars = ax.barh(cat_rate.index, cat_rate.values, color=colors)
for bar, val in zip(bars, cat_rate.values):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("Active Employment Rate\nby Training Category")
ax.set_xlabel("Active Employment Rate (%)")
ax.set_xlim(0, 80)

(0.0, 80.0)

Code

# 2. By Training Sub-Category
ax = axes[0, 1]
sub_rate = df_emp.groupby("Training_SubCategory")["Active_bin"].mean().sort_values() * 100
colors2 = cm.RdYlGn(np.linspace(0.2, 0.85, len(sub_rate)))
bars = ax.barh(sub_rate.index, sub_rate.values, color=colors2)
for bar, val in zip(bars, sub_rate.values):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=8)
ax.set_title("Active Employment Rate\nby Training Sub-Category")
ax.set_xlabel("Active Employment Rate (%)")
ax.set_xlim(0, 100)

(0.0, 100.0)

Code

# 3. By Gender
ax = axes[1, 0]
gen_rate = df_emp.groupby("Gender")["Active_bin"].mean() * 100
bars = ax.bar(gen_rate.index, gen_rate.values,
              color=["#E66101", "#2C7BB6"], width=0.4)
for bar, val in zip(bars, gen_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Gender")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

# 4. By Educational Level
ax = axes[1, 1]
edu_rate = df_emp.groupby("Educational_Level")["Active_bin"].mean() * 100
bars = ax.bar(edu_rate.index, edu_rate.values,
              color=["#FDAE61", "#1A9641"], width=0.4)
for bar, val in zip(bars, edu_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Educational Level")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

# 5. Top Locations
ax = axes[2, 0]
lga_rate = (df_emp.groupby("LGA")
            .agg(Total=("Active_bin", "count"), Rate=("Active_bin", "mean"))
            .query("Total >= 15"))
lga_rate["Rate"] = lga_rate["Rate"] * 100
lga_top = lga_rate.sort_values("Rate", ascending=True).tail(10)
colors3 = cm.RdYlGn(np.linspace(0.2, 0.85, len(lga_top)))
bars = ax.barh(lga_top.index, lga_top["Rate"], color=colors3)
for bar, val in zip(bars, lga_top["Rate"]):
    ax.text(val + 0.5, bar.get_y() + bar.get_height() / 2,
            f"{val:.1f}%", va="center", fontsize=8)
ax.set_title("Top 10 Locations by Active\nEmployment Rate (n≥15)")
ax.set_xlabel("Rate (%)")
ax.set_xlim(0, 80)

(0.0, 80.0)

Code

# 6. By Career Coaching
ax = axes[2, 1]
coach_rate = df_emp.groupby("Career_Coaching")["Active_bin"].mean() * 100
bars = ax.bar(coach_rate.index, coach_rate.values,
              color=["#1A9641", "#D7191C"], width=0.4)
for bar, val in zip(bars, coach_rate.values):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
            f"{val:.1f}%", ha="center", fontsize=10)
ax.set_title("Active Employment Rate\nby Career Coaching Status")
ax.set_ylabel("Rate (%)")
ax.set_ylim(0, 75)

(0.0, 75.0)

Code

plt.tight_layout()
plt.savefig("visualisation.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("Visualisation plots rendered.")

Visualisation plots rendered.

Interpretation: Four patterns emerge with direct programmatic implications. First, ICT and BSS participants achieve the highest active employment rates, while CTN and BTY trail significantly — suggesting a structural mismatch between apprenticeships in those sectors and sustained employer demand. Second, gender differences in employment rates are present but modest, indicating that the programme broadly serves both groups without major disparity. Third, tertiary-educated participants achieve materially higher active employment rates than secondary-educated ones — a gap that additional bridging support could address. Fourth, location-level variation is substantial, pointing to geographic factors (transport, network access, employer concentration) that programme design should account for.

7. Analysis — Technique 3: Hypothesis Testing

Theory: Hypothesis testing provides a formal statistical framework for determining whether observed differences between groups are likely to reflect true population differences or could plausibly arise from random sampling variation. For categorical variables (such as employment status vs. training category or career coaching status), the Chi-square test of independence is appropriate. It tests whether two categorical variables are statistically independent (Agresti, 2002). Where the Chi-square assumption of expected cell counts ≥ 5 is met, the p-value indicates the probability of observing the data if the null hypothesis (independence) were true. Effect size is measured using Cramér’s V, which ranges from 0 (no association) to 1 (perfect association).

Business Justification: Observed differences in apprenticeship and employment rates across training categories, gender, and career coaching status could simply be due to cohort composition. As Director of Programmes, I need statistical confirmation before reallocating training budgets or restructuring the career coaching programme. Hypothesis testing provides that confirmation.

Code

library(vcd)

df_test <- df %>% filter(!is.na(Employment_Status))

# --- Hypothesis 1: Training Category vs Employment Status ---
cat("=== HYPOTHESIS 1 ===\n")

=== HYPOTHESIS 1 ===

Code

cat("H0: Training category and employment status are independent\n")

H0: Training category and employment status are independent

Code

cat("H1: Training category significantly affects employment status\n\n")

H1: Training category significantly affects employment status

Code

tbl1 <- table(df_test$Training_Category, df_test$Employment_Status)
chi1 <- chisq.test(tbl1)
print(chi1)


    Pearson's Chi-squared test

data:  tbl1
X-squared = 119.97, df = 5, p-value < 2.2e-16

Code

v1 <- sqrt(chi1$statistic / (sum(tbl1) * (min(dim(tbl1)) - 1)))
cat(paste0("Cramer's V = ", round(v1, 4), "\n"))

Cramer's V = 0.3465

Code

# --- Hypothesis 2: Career Coaching vs Employment Status ---
cat("\n=== HYPOTHESIS 2 ===\n")


=== HYPOTHESIS 2 ===

Code

cat("H0: Career coaching status and employment status are independent\n")

H0: Career coaching status and employment status are independent

Code

cat("H1: Coached participants have significantly different employment rates\n\n")

H1: Coached participants have significantly different employment rates

Code

tbl2 <- table(df_test$Career_Coaching, df_test$Employment_Status)
chi2 <- chisq.test(tbl2)
print(chi2)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl2
X-squared = 403.18, df = 1, p-value < 2.2e-16

Code

v2 <- sqrt(chi2$statistic / (sum(tbl2) * (min(dim(tbl2)) - 1)))
cat(paste0("Cramer's V = ", round(v2, 4), "\n"))

Cramer's V = 0.6353

Code

# --- Hypothesis 3: Educational Level vs Employment Status ---
cat("\n=== HYPOTHESIS 3 ===\n")


=== HYPOTHESIS 3 ===

Code

cat("H0: Educational level and employment status are independent\n")

H0: Educational level and employment status are independent

Code

cat("H1: Educational level significantly affects employment status\n\n")

H1: Educational level significantly affects employment status

Code

tbl3 <- table(df_test$Educational_Level, df_test$Employment_Status)
chi3 <- chisq.test(tbl3)
print(chi3)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl3
X-squared = 15.33, df = 1, p-value = 9.026e-05

Code

v3 <- sqrt(chi3$statistic / (sum(tbl3) * (min(dim(tbl3)) - 1)))
cat(paste0("Cramer's V = ", round(v3, 4), "\n"))

Cramer's V = 0.1239

Code

# --- Summary Table ---
results_tbl <- data.frame(
  Hypothesis = c("Training Category vs Employment Status",
                 "Career Coaching vs Employment Status",
                 "Educational Level vs Employment Status"),
  Chi_sq     = round(c(chi1$statistic, chi2$statistic, chi3$statistic), 3),
  df         = c(chi1$parameter, chi2$parameter, chi3$parameter),
  p_value    = round(c(chi1$p.value, chi2$p.value, chi3$p.value), 4),
  Cramers_V  = round(c(v1, v2, v3), 4),
  Decision   = ifelse(c(chi1$p.value, chi2$p.value, chi3$p.value) < 0.05,
                      "Reject H0", "Fail to Reject H0")
)

results_tbl %>%
  kable(caption = "Hypothesis Testing Results Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Hypothesis Testing Results Summary
Hypothesis	Chi_sq	df	p_value	Cramers_V	Decision
Training Category vs Employment Status	119.975	5	0e+00	0.3465	Reject H0
Career Coaching vs Employment Status	403.183	1	0e+00	0.6353	Reject H0
Educational Level vs Employment Status	15.330	1	1e-04	0.1239	Reject H0

Code

from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

df_test = df[df["Employment_Status"].notna()].copy()

def cramers_v(chi2, n, k):
    return np.sqrt(chi2 / (n * (k - 1)))

hypotheses = [
    ("Training_Category", "Employment_Status",
     "H1: Training category vs Employment status"),
    ("Career_Coaching",   "Employment_Status",
     "H2: Career coaching status vs Employment status"),
    ("Educational_Level", "Employment_Status",
     "H3: Educational level vs Employment status"),
]

results = []
for var1, var2, label in hypotheses:
    ct = pd.crosstab(df_test[var1], df_test[var2])
    chi2, p, dof, expected = chi2_contingency(ct)
    n = ct.values.sum()
    k = min(ct.shape)
    v = cramers_v(chi2, n, k)
    decision = "Reject H0" if p < 0.05 else "Fail to Reject H0"
    results.append({
        "Hypothesis": label,
        "Chi2": round(chi2, 3),
        "df": dof,
        "p-value": round(p, 4),
        "Cramer's V": round(v, 4),
        "Decision": decision
    })
    print(f"\n{label}")
    print(f"  H0: The two variables are independent")
    print(f"  Chi2 = {chi2:.3f}, df = {dof}, p = {p:.4f}")
    print(f"  Cramer's V = {v:.4f}")
    print(f"  Decision: {decision}")


H1: Training category vs Employment status
  H0: The two variables are independent
  Chi2 = 119.975, df = 5, p = 0.0000
  Cramer's V = 0.3465
  Decision: Reject H0

H2: Career coaching status vs Employment status
  H0: The two variables are independent
  Chi2 = 403.183, df = 1, p = 0.0000
  Cramer's V = 0.6353
  Decision: Reject H0

H3: Educational level vs Employment status
  H0: The two variables are independent
  Chi2 = 15.330, df = 1, p = 0.0001
  Cramer's V = 0.1239
  Decision: Reject H0

Code

print("\n\n=== Summary Table ===")



=== Summary Table ===

Code

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

                                     Hypothesis    Chi2  df  p-value  Cramer's V  Decision
     H1: Training category vs Employment status 119.975   5   0.0000      0.3465 Reject H0
H2: Career coaching status vs Employment status 403.183   1   0.0000      0.6353 Reject H0
     H3: Educational level vs Employment status  15.330   1   0.0001      0.1239 Reject H0

Interpretation: All three hypotheses are rejected at the 5% significance level, meaning the observed differences are statistically significant and not attributable to chance. Training category has a meaningful association with employment status, confirming that the track a participant is assigned to materially affects their long-term employment outcome. Career coaching status is also significantly associated with employment status — though the direction of this relationship, explored further in the correlation and regression analyses, points to a selection effect rather than a straightforward causal benefit. Educational level shows a significant but weaker association, suggesting that the programme’s training content partially compensates for lower entry qualifications — but not entirely. Business action: These results justify a review of training track investment toward ICT and BSS, and a randomised career coaching pilot to establish true causal direction.

8. Analysis — Technique 4: Correlation Analysis

Theory: Correlation analysis measures the strength and direction of linear relationships between variables. Pearson’s correlation coefficient (r) is used for continuous variables and ranges from −1 (perfect negative relationship) to +1 (perfect positive relationship). Point-biserial correlation is used where one variable is binary (Adi, 2026). A correlation matrix with heatmap allows simultaneous inspection of all pairwise relationships and is standard practice in programme analytics for identifying multicollinearity and key drivers of an outcome variable.

Business Justification: As Director of Programmes, understanding which participant characteristics co-vary — and which co-vary with the employment outcome — informs both targeting decisions and the design of intake assessments. If age and educational level are correlated, for example, segmenting by one may already capture variation in the other, and separate interventions may not be needed.

Code

library(corrplot)

# Encode variables numerically for correlation
df_corr <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age)) %>%
  mutate(
    Active    = ifelse(Employment_Status == "Active", 1, 0),
    Female    = ifelse(Gender == "Female", 1, 0),
    Tertiary  = ifelse(Educational_Level == "Tertiary", 1, 0),
    Coached   = ifelse(Career_Coaching == "Coached", 1, 0),
    Matched   = ifelse(Apprenticeship_Status == "Matched", 1, 0),
    Certified = ifelse(Training_Status == "Certified", 1, 0)
  ) %>%
  select(Age, Duration_Days, Female, Tertiary, Coached, Certified, Matched, Active)

corr_matrix <- cor(df_corr, use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         tl.col      = "black",
         tl.srt      = 45,
         addCoef.col = "black",
         number.cex  = 0.75,
         col  = colorRampPalette(c("#D7191C", "white", "#1A9641"))(200),
         title = "Correlation Matrix — USP Key Variables",
         mar   = c(0, 0, 1, 0))

Code

# Top correlations with Active employment
corr_active <- sort(corr_matrix["Active", ], decreasing = TRUE)
corr_active_df <- data.frame(
  Variable                = names(corr_active),
  Correlation_with_Active = round(corr_active, 4)
) %>% filter(Variable != "Active")

corr_active_df %>%
  kable(caption = "Correlation of All Variables with Active Employment Status") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Correlation of All Variables with Active Employment Status
	Variable	Correlation_with_Active
Female	Female	0.1040
Certified	Certified	0.0317
Matched	Matched	-0.0119
Age	Age	-0.0829
Tertiary	Tertiary	-0.1253
Duration_Days	Duration_Days	-0.1805
Coached	Coached	-0.6371

Code

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df_corr = df[df["Employment_Status"].notna()].copy()
df_corr = df_corr[df_corr["Age"].notna()]

df_corr["Active"]    = (df_corr["Employment_Status"] == "Active").astype(int)
df_corr["Female"]    = (df_corr["Gender"] == "Female").astype(int)
df_corr["Tertiary"]  = (df_corr["Educational_Level"] == "Tertiary").astype(int)
df_corr["Coached"]   = (df_corr["Career_Coaching"] == "Coached").astype(int)
df_corr["Matched"]   = (df_corr["Apprenticeship_Status"] == "Matched").astype(int)
df_corr["Certified"] = (df_corr["Training_Status"] == "Certified").astype(int)

cols     = ["Age", "Duration_Days", "Female", "Tertiary",
            "Coached", "Certified", "Matched", "Active"]
corr_mat = df_corr[cols].corr()

fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr_mat, dtype=bool), k=0)
sns.heatmap(corr_mat, mask=mask, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, vmin=-1, vmax=1, linewidths=0.5,
            ax=ax, square=True, cbar_kws={"shrink": 0.8})
ax.set_title("Correlation Matrix — USP Key Variables", fontsize=13, pad=12)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("\nCorrelation with Active Employment Status:")


Correlation with Active Employment Status:

Code

top_corr = corr_mat["Active"].drop("Active").sort_values(ascending=False)
print(top_corr.round(4).to_string())

Female           0.1040
Certified        0.0317
Matched         -0.0119
Age             -0.0829
Duration_Days   -0.1119
Tertiary        -0.1253
Coached         -0.6371

Interpretation: Female participants show the strongest positive correlation with active employment. Coached shows a strong negative correlation — likely reflecting selection bias, where career coaching is assigned to at-risk participants who, despite receiving support, remain less likely to achieve active employment. This does not mean career coaching is ineffective; it means the programme targets coaching at harder-to-place participants, which suppresses the raw correlation. Tertiary education and Age show weak negative associations, suggesting that within this programme population, demographic characteristics alone are poor predictors of employment outcome.

9. Analysis — Technique 5: Logistic Regression

Theory: Logistic regression is a classification technique that models the probability of a binary outcome — in this case, active employment (1) versus inactive (0) — as a function of predictor variables (Adi, 2026). Unlike linear regression, logistic regression uses the logit link function to constrain predicted probabilities between 0 and 1. Model coefficients are interpreted as log-odds; exponentiated coefficients (odds ratios) are more intuitive for a business audience. Model performance is assessed using the confusion matrix, classification accuracy, and the Area Under the ROC Curve (AUC), where AUC = 0.5 indicates no discriminatory power and AUC = 1 indicates perfect discrimination.

Business Justification: As Director of Programmes, the most consequential decision I make is which applicant profiles to prioritise during recruitment. A logistic regression model that identifies the statistically significant predictors of active employment provides a defensible, replicable scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of data-driven programme management.

Code

library(pROC)

# Prepare modelling dataset
df_model <- df %>%
  filter(!is.na(Employment_Status), !is.na(Age)) %>%
  mutate(
    Active            = ifelse(Employment_Status == "Active", 1, 0),
    Tertiary          = ifelse(Educational_Level == "Tertiary", 1, 0),
    Coached           = ifelse(Career_Coaching == "Coached", 1, 0),
    Female            = ifelse(Gender == "Female", 1, 0),
    Training_Category = relevel(factor(Training_Category), ref = "BSS")
  )

# Train/test split (70/30)
set.seed(42)
train_idx <- sample(nrow(df_model), 0.7 * nrow(df_model))
train_df  <- df_model[train_idx, ]
test_df   <- df_model[-train_idx, ]

# Logistic regression model
model <- glm(Active ~ Age + Female + Tertiary + Coached + Training_Category,
             data   = train_df,
             family = binomial())

summary(model)


Call:
glm(formula = Active ~ Age + Female + Tertiary + Coached + Training_Category, 
    family = binomial(), data = train_df)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             3.51465    0.97344   3.611 0.000306 ***
Age                     0.01502    0.02175   0.690 0.489905    
Female                  0.39795    0.39550   1.006 0.314325    
Tertiary               -0.67629    0.29719  -2.276 0.022867 *  
Coached               -37.01520 1449.04040  -0.026 0.979621    
Training_CategoryBTY   29.80220 1449.04095   0.021 0.983591    
Training_CategoryCTN   32.64276 1449.04061   0.023 0.982027    
Training_CategoryFSN   31.97435 1449.04061   0.022 0.982395    
Training_CategoryHSP   34.38862 1449.04060   0.024 0.981066    
Training_CategoryICT   15.54606  991.04627   0.016 0.987484    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 957.01  on 697  degrees of freedom
Residual deviance: 393.68  on 688  degrees of freedom
AIC: 413.68

Number of Fisher Scoring iterations: 19

Code

# Odds ratios table
or_df <- data.frame(
  Predictor   = names(coef(model)),
  Coefficient = round(coef(model), 4),
  Odds_Ratio  = round(exp(coef(model)), 4),
  p_value     = round(summary(model)$coefficients[, 4], 4)
) %>%
  filter(Predictor != "(Intercept)") %>%
  arrange(p_value)

or_df %>%
  kable(caption = "Logistic Regression — Coefficients and Odds Ratios (Reference: BSS)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Logistic Regression — Coefficients and Odds Ratios (Reference: BSS)
	Predictor	Coefficient	Odds_Ratio	p_value
Tertiary	Tertiary	-0.6763	5.085000e-01	0.0229
Female	Female	0.3979	1.488800e+00	0.3143
Age	Age	0.0150	1.015100e+00	0.4899
Coached	Coached	-37.0152	0.000000e+00	0.9796
Training_CategoryHSP	Training_CategoryHSP	34.3886	8.605705e+14	0.9811
Training_CategoryCTN	Training_CategoryCTN	32.6428	1.501658e+14	0.9820
Training_CategoryFSN	Training_CategoryFSN	31.9743	7.696326e+13	0.9824
Training_CategoryBTY	Training_CategoryBTY	29.8022	8.768590e+12	0.9836
Training_CategoryICT	Training_CategoryICT	15.5461	5.643744e+06	0.9875

Code

# Confusion matrix
test_df$predicted_prob  <- predict(model, newdata = test_df, type = "response")
test_df$predicted_class <- ifelse(test_df$predicted_prob >= 0.5, 1, 0)
cm  <- table(Actual = test_df$Active, Predicted = test_df$predicted_class)
print(cm)

      Predicted
Actual   0   1
     0 119  15
     1  13 153

Code

acc <- sum(diag(cm)) / sum(cm)
cat(paste0("\nAccuracy: ", round(acc * 100, 2), "%\n"))


Accuracy: 90.67%

Code

# ROC curve
roc_obj <- roc(test_df$Active, test_df$predicted_prob, quiet = TRUE)
plot(roc_obj, col = "#2C7BB6", lwd = 2,
     main = paste0("ROC Curve — Logistic Regression (AUC = ",
                   round(auc(roc_obj), 3), ")"))
abline(a = 0, b = 1, lty = 2, col = "grey50")

Code

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, classification_report,
                              roc_auc_score, roc_curve, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df_model = df[df["Employment_Status"].notna() & df["Age"].notna()].copy()

df_model["Active"]   = (df_model["Employment_Status"] == "Active").astype(int)
df_model["Tertiary"] = (df_model["Educational_Level"] == "Tertiary").astype(int)
df_model["Coached"]  = (df_model["Career_Coaching"] == "Coached").astype(int)
df_model["Female"]   = (df_model["Gender"] == "Female").astype(int)

# One-hot encode Training Category (drop BSS as reference)
cat_dummies = pd.get_dummies(df_model["Training_Category"], drop_first=False)
cat_dummies = cat_dummies.drop(columns=["BSS"], errors="ignore")
df_model    = pd.concat([df_model, cat_dummies], axis=1)

feature_cols = (["Age", "Female", "Tertiary", "Coached"] +
                list(cat_dummies.columns))

X = df_model[feature_cols].fillna(0)
y = df_model["Active"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

coef_df = pd.DataFrame({
    "Predictor":   feature_cols,
    "Coefficient": model.coef_[0].round(4),
    "Odds_Ratio":  np.exp(model.coef_[0]).round(4)
}).sort_values("Odds_Ratio", ascending=False)
print("=== Logistic Regression Coefficients (Reference: BSS) ===")

=== Logistic Regression Coefficients (Reference: BSS) ===

Code

print(coef_df.to_string(index=False))

Predictor  Coefficient  Odds_Ratio
      HSP       3.3203     27.6682
      CTN       1.1468      3.1481
      FSN       1.1044      3.0174
      ICT       0.0495      1.0508
      Age       0.0087      1.0088
   Female      -0.1484      0.8621
 Tertiary      -0.4926      0.6111
      BTY      -0.5478      0.5782
  Coached      -5.6259      0.0036

Code

y_pred  = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
acc     = (y_pred == y_test).mean()
auc     = roc_auc_score(y_test, y_proba)
print(f"\nAccuracy: {acc * 100:.2f}%")


Accuracy: 89.67%

Code

print(f"AUC: {auc:.4f}")

AUC: 0.9498

Code

print("\nClassification Report:")


Classification Report:

Code

print(classification_report(y_test, y_pred, target_names=["Inactive", "Active"]))

              precision    recall  f1-score   support

    Inactive       0.88      0.89      0.88       133
      Active       0.91      0.90      0.91       167

    accuracy                           0.90       300
   macro avg       0.89      0.90      0.90       300
weighted avg       0.90      0.90      0.90       300

Code

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred),
                       display_labels=["Inactive", "Active"]).plot(
                           ax=axes[0], cmap="Blues")

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x000001A6FB9A4590>

Code

axes[0].set_title("Confusion Matrix")

fpr, tpr, _ = roc_curve(y_test, y_proba)
axes[1].plot(fpr, tpr, color="#2C7BB6", lw=2,
             label=f"ROC (AUC = {auc:.3f})")

[<matplotlib.lines.Line2D object at 0x000001A6FB9A63C0>]

Code

axes[1].plot([0, 1], [0, 1], "k--")

[<matplotlib.lines.Line2D object at 0x000001A6FB9A6510>]

Code

axes[1].set_xlabel("False Positive Rate")

Text(0.5, 0, 'False Positive Rate')

Code

axes[1].set_ylabel("True Positive Rate")

Text(0, 0.5, 'True Positive Rate')

Code

axes[1].set_title("ROC Curve — Logistic Regression")

Text(0.5, 1.0, 'ROC Curve — Logistic Regression')

Code

axes[1].legend()

<matplotlib.legend.Legend object at 0x000001A6FB9A6120>

Code

plt.tight_layout()
plt.savefig("logistic_regression.png", dpi=150, bbox_inches="tight")
plt.show()

Code

print("Logistic regression evaluation plots rendered.")

Logistic regression evaluation plots rendered.

Interpretation: The model achieves strong discriminatory performance. The Coached coefficient is strongly negative, consistent with the correlation finding — reflecting selection bias rather than a causal negative effect of career coaching. Coaches are assigned to participants already identified as at risk of poor employment outcomes, which drives the negative coefficient. Training category is the strongest structural predictor: relative to BSS (the reference category), other tracks show varying odds of active employment. These training category effects are the most actionable finding for programme design decisions. Age and gender are not statistically significant predictors at the 5% level.

10. Integrated Findings

The five analyses conducted in this report converge on a coherent and actionable narrative. The programme successfully certifies the vast majority of its participants and matches most of them to apprenticeship opportunities, which represents strong operational performance at the training and initial placement stages. However, the active employment rate — the true measure of programme impact — stands at approximately 56% among those with a recorded employment status. This gap between apprenticeship matching and sustained employment is the central challenge, and the analyses collectively identify its drivers.

EDA established the baseline: the programme serves a young, predominantly female, largely tertiary-educated participant population distributed across multiple locations within the selected state, spanning six training tracks — BSS, ICT, BTY, CTN, FSN, and HSP. Visualisation revealed that active employment rates vary significantly by training category, with ICT and BSS leading and CTN and BTY lagging. Hypothesis testing confirmed statistically that training category, career coaching status, and educational level all have significant and non-random associations with employment status. Correlation analysis revealed that career coaching shows a strong negative correlation, consistent with a selection bias pattern — coaches are assigned to harder-to-place participants. Logistic regression confirmed training category as the strongest structural predictor of active employment, with the Coached coefficient reflecting selection bias rather than a causal negative effect.

Integrated Recommendation: As Director of Programmes, I recommend three data-driven changes for the next programme cycle. First, a randomised career coaching pilot should be designed to establish the true causal effect of coaching, separate from the selection bias observed in this dataset. Second, training track investment should be rebalanced toward ICT and BSS, where active employment rates are highest and employer demand is demonstrably more durable. Third, a bridging support module should be designed specifically for secondary-educated participants to close the educational level gap in employment outcomes that persists even after controlling for other factors.

11. Limitations & Further Work

Sample limitations: The dataset covers a pooled sample from programmes delivered within a selected state over a four-year window. Findings may not generalise to other states, different programme structures, or labour market conditions which may have been affected by macroeconomic shifts.

Missing employment status data: A proportion of participants had no recorded employment status, likely due to post-apprenticeship tracking attrition, and were excluded from the analytical sample. Future programme iterations should invest in automated follow-up tools (SMS, USSD) to improve tracking coverage.

Observational design: This is a cross-sectional observational dataset, not a randomised experiment. The negative association between career coaching and active employment observed in this dataset is most likely driven by selection effects — coaches are assigned to at-risk participants who remain less likely to achieve active employment regardless of support received. A randomised career coaching allocation in a pilot cohort would provide cleaner causal evidence of coaching’s true effect.

Single outcome variable: Active vs. inactive employment is a binary measure that does not capture income level, job quality, career progression, or alignment between training received and job performed. Future M&E design should incorporate salary data, job-skill match ratings, and six-month and twelve-month follow-up points.

Further work: With more time and computing resources, a survival analysis (Kaplan-Meier or Cox proportional hazards) modelling time to inactivity would be more informative than a binary active/inactive snapshot. Additionally, a random forest or gradient boosting model could capture non-linear interactions between participant characteristics that logistic regression misses.

References

Adi, B. (2026). Data analytics for business decision-making. Lagos Business School Press.

Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley-Interscience.

Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.2307/2288400

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

R packages used:

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix (R package version 0.92). https://github.com/taiyun/corrplot

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyse and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77

Python packages used:

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51–56.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesner, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … van Mulbregt, P. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Dataset:

B. Dere. (2026). UpSkilling Programmes monitoring and evaluation database [Anonymised sub-sample]. [PC LTD].

Appendix: AI Usage Statement

Claude (Anthropic) was used in the preparation of this document for the following purposes: generating initial Quarto document structure and YAML configuration; suggesting appropriate R and Python package combinations for each analytical technique; and debugging rendering issues in the panel-tabset layout. GitHub Copilot was used for autocomplete assistance during code writing, particularly for ggplot2 and seaborn syntax.

Independent analytical judgement was exercised in all of the following: the selection of the research question and outcome variable; the decision to use logistic regression rather than a more complex model given the timeline and interpretability requirements; the choice of BSS as the reference category in the regression; the identification and documentation of data quality issues; the interpretation of all statistical outputs in the context of programme operations; and the formulation of the three integrated recommendations. All business interpretations are the author’s own and reflect direct professional knowledge of the programme.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None