Predicting Active Employment Outcomes in the Employment Skills Programme (ESP)
Author
Bisola Dere
Published
May 11, 2026
1. Executive Summary
This study analyses data from a State in Nigeria’s Employment Skills Programme, a vocational training and job placement initiative that enrolled 2,006 participants across four batches between June 2021 and November 2022. The central business question is: which participant profiles and training categories are most likely to result in active employment, and how can this knowledge make batch design, trainer allocation, and placement partnership decisions more data-driven?
Data were extracted from the programme’s internal monitoring and evaluation database, covering 25 variables including demographic characteristics, training track, mentoring status, certification outcome, placement status, and final employment status. Five analytical techniques were applied: Exploratory Data Analysis (EDA), Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression. Key findings reveal that training category, educational level, and mentoring status are statistically significant predictors of active employment. Participants in Information Technology and Business Support tracks, those with tertiary education, and those who received mentoring consistently show higher rates of active employment. The integrated recommendation is that future batches should prioritise mentoring coverage for all participants and increase investment in IT and Business Support training tracks, while targeted support should be designed for secondary-school-educated participants to close the employment gap.
2. Professional Disclosure
Job Title: Director of Programmes Organisation/Sector: PaTiTi Consulting | Skills Development / Social Impact
Exploratory Data Analysis (EDA): As Director of Programmes, understanding who is enrolling in the programme before making design decisions is foundational. EDA allows me to monitor whether the programme is reaching its intended demographic — young, low-income Lagos residents — and to detect shifts in participant profiles across batches that would require a programmatic response. Without this diagnostic layer, decisions on recruitment targeting and resource allocation are based on assumption rather than evidence.
Visualisation: Reporting to funders, government partners, and the board requires communicating programme performance clearly and efficiently. Visualisation of placement rates by training category, gender, and LGA gives me a dashboard-ready view that directly informs which tracks to scale, discontinue, or redesign in future batches. It also enables me to present evidence to partners in a form that drives decisions rather than simply informs them.
Hypothesis Testing: A recurring operational question I face is whether observed differences in outcomes — for example, whether mentored participants place at higher rates than unmentored ones, or whether one training category outperforms another — are statistically real or simply sampling noise. Hypothesis testing provides the rigorous basis needed to act on those differences, justify resource reallocation, and defend programmatic changes to stakeholders.
Correlation Analysis: As Director, I allocate mentoring resources and decide which participant segments receive intensive support. Correlation analysis between age, educational level, training duration, and employment outcomes allows me to test whether those targeting assumptions are evidence-based or inherited from previous programme designs. Understanding which variables move together informs more precise targeting in future batches.
Logistic Regression: The most consequential decision I make each cycle is which applicant profiles to prioritise during batch recruitment. A logistic regression model identifying the strongest predictors of active employment provides a defensible, data-driven scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of systematic programme improvement.
3. Data Collection & Sampling
Source: Employment Skills Programme (ESP) internal monitoring and evaluation (M&E) database, maintained by the Directorate of Programmes.
Collection Method: The dataset was extracted directly from the programme’s management information system (MIS) by the Director of Programmes. No third-party data was used. The extract covers all enrolled participants across Batch 1 through Batch 4, including a sub-cohort designated Batch 4.2.
Sampling Frame: The full population of ESP participants registered between June 2021 and November 2022. This is a census extract, not a sample — every enrolled participant is represented, making it a complete administrative dataset.
Sample Size: 2,006 observations (participants) across 25 variables.
Time Period: June 2021 to November 2022 (approximately 17 months), covering four programme batches.
Variables: The dataset includes batch identifier, disability status (PWD), mentoring status, enrollment ID, qualification, educational level, age, age bracket, returnee status, LGA of residence, marital status, gender, training category and sub-category, programme start and end dates, training status, placement status, company placed with, reason if declined, date of employment, employment type, employment status, and job formality classification.
Ethical & Consent Notes: All participant data was collected with informed consent as part of the programme enrolment process. Personally identifiable information (names, national ID numbers, phone numbers) has been removed from this analytical dataset. Enrollment IDs serve as anonymised participant identifiers. The dataset is stored on an encrypted organisational drive and is not shared beyond the immediate analytical team. This analysis has been approved for academic purposes under the programme’s data governance policy.
Data Quality Issues Identified: 1. Missing Employment Status (350 rows): Approximately 17% of participants have no recorded employment status. These are predominantly participants still within the programme cycle at the time of data extraction or those who became unreachable during follow-up. These records are excluded from the regression and hypothesis testing analyses but retained in EDA counts, with the exclusion documented. 2. Age Outlier: One participant is recorded as age 1, which is clearly a data entry error. This record is excluded from age-related analyses.
for col in ["Gender","Educational_Level","Training_Status","Placement_Status","Employment_Status","Mentoring"]:print(f"\n{col}:")print(df[col].value_counts(dropna=False).to_string())
Gender:
Gender
Female 1233
Male 771
Educational_Level:
Educational_Level
Tertiary 1322
Secondary 682
Training_Status:
Training_Status
Certified 1869
Not Certified 121
Dropped 14
Placement_Status:
Placement_Status
Placed 1708
Not Available 139
Available 99
Self Employed 58
Employment_Status:
Employment_Status
Active 925
Inactive 729
NaN 350
Mentoring:
Mentoring
Mentored 1318
Not Mentored 686
Interpretation: The dataset covers 2,006 participants with a mean age of approximately 30 years, reflecting a young workforce target population. Training programmes ran for an average of 57 days (range: 17–109 days), indicating variability across training tracks. Female participants (62%) outnumber male participants (38%). The majority hold tertiary qualifications (66%) and 93% achieved certification. However, only 46% of those with a recorded employment status are actively employed at the time of data extraction, pointing to a significant post-placement retention or engagement challenge that this analysis seeks to understand.
5. Analysis — Technique 1: Exploratory Data Analysis
Theory: Exploratory Data Analysis (EDA) is the practice of summarising a dataset’s main characteristics, often using statistical and visual methods, before formal modelling. It is grounded in the tradition established by Tukey (1977) and is used to detect patterns, anomalies, and relationships that guide subsequent analysis. EDA is particularly important for administrative datasets where data quality issues are common and assumptions about the population may not hold.
Business Justification: As Director of Programmes, EDA provides the foundation for all downstream decisions. Before allocating resources, adjusting training tracks, or redesigning batch recruitment, I need to know who is actually in the programme, how they are distributed across demographic categories, and where the data has gaps. EDA surfaces these facts systematically rather than relying on anecdotal reports from field staff.
library(ggplot2)library(scales)# Age distributionp1 <-ggplot(df %>%filter(!is.na(Age)), aes(x = Age)) +geom_histogram(binwidth =5, fill ="#2C7BB6", colour ="white", alpha =0.85) +geom_vline(aes(xintercept =mean(Age, na.rm=TRUE)), colour ="#D7191C", linetype ="dashed", linewidth =1) +labs(title ="Age Distribution of ESP Participants",subtitle ="Dashed line = mean age (≈30 years)",x ="Age (years)", y ="Count") +theme_minimal(base_size =13)print(p1)
Code
# Certification rate by batchcert_batch <- df %>%group_by(Batch, Training_Status) %>%summarise(n =n(), .groups ="drop") %>%group_by(Batch) %>%mutate(pct = n /sum(n) *100)p2 <-ggplot(cert_batch, aes(x = Batch, y = pct, fill = Training_Status)) +geom_col(position ="stack") +scale_fill_manual(values =c("Certified"="#1A9641", "Not Certified"="#FDAE61","Dropped"="#D7191C")) +labs(title ="Training Status by Batch",x ="Batch", y ="Percentage (%)", fill ="Status") +theme_minimal(base_size =13)print(p2)
Code
# Placement status distributionplacement_counts <- df %>%count(Placement_Status) %>%mutate(pct = n /sum(n) *100)p3 <-ggplot(placement_counts, aes(x =reorder(Placement_Status, -n), y = pct, fill = Placement_Status)) +geom_col(show.legend =FALSE) +scale_fill_manual(values =c("Placed"="#1A9641","Not Available"="#D7191C","Available"="#FDAE61","Self Employed"="#2C7BB6")) +geom_text(aes(label =paste0(round(pct,1),"%")), vjust =-0.5, size =4) +labs(title ="Placement Status of ESP Participants",x ="Placement Status", y ="Percentage (%)") +theme_minimal(base_size =13)print(p3)
Code
# Employment status breakdownemp_counts <- df %>%filter(!is.na(Employment_Status)) %>%count(Employment_Status) %>%mutate(pct = n/sum(n)*100)p4 <-ggplot(emp_counts, aes(x = Employment_Status, y = pct, fill = Employment_Status)) +geom_col(show.legend =FALSE) +scale_fill_manual(values =c("Active"="#1A9641","Inactive"="#D7191C")) +geom_text(aes(label =paste0(round(pct,1),"%")), vjust =-0.5, size =4) +labs(title ="Employment Status Among Placed Participants",subtitle ="350 participants with no recorded employment status excluded",x ="Employment Status", y ="Percentage (%)") +theme_minimal(base_size =13)print(p4)
Code
import matplotlib.pyplot as pltimport matplotlib.ticker as mtickfig, axes = plt.subplots(2, 2, figsize=(14, 10))fig.suptitle("EDA — ESP Programme Overview", fontsize=15, fontweight="bold", y=1.01)# Age distributionax = axes[0,0]age_clean = df["Age"].dropna()ax.hist(age_clean, bins=15, color="#2C7BB6", edgecolor="white", alpha=0.85)ax.axvline(age_clean.mean(), color="#D7191C", linestyle="--", linewidth=1.5)ax.set_title("Age Distribution")ax.set_xlabel("Age (years)")ax.set_ylabel("Count")# Certification by Batchax = axes[0,1]cert_pivot = pd.crosstab(df["Batch"], df["Training_Status"], normalize="index") *100cert_pivot[["Certified","Not Certified","Dropped"]].plot( kind="bar", stacked=True, ax=ax, color=["#1A9641","#FDAE61","#D7191C"])ax.set_title("Training Status by Batch (%)")ax.set_xlabel("Batch")ax.set_ylabel("Percentage")ax.legend(loc="lower right", fontsize=8)ax.tick_params(axis="x", rotation=30)# Placement Statusax = axes[1,0]pl_counts = df["Placement_Status"].value_counts()pl_pct = pl_counts / pl_counts.sum() *100colors = {"Placed":"#1A9641","Not Available":"#D7191C","Available":"#FDAE61","Self Employed":"#2C7BB6"}bars = ax.bar(pl_pct.index, pl_pct.values, color=[colors.get(x,"grey") for x in pl_pct.index])for bar, val inzip(bars, pl_pct.values): ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,f"{val:.1f}%", ha="center", fontsize=9)ax.set_title("Placement Status (%)")ax.set_xlabel("Status")ax.set_ylabel("Percentage")ax.tick_params(axis="x", rotation=15)# Employment Statusax = axes[1,1]emp = df["Employment_Status"].dropna().value_counts()emp_pct = emp / emp.sum() *100bars = ax.bar(emp_pct.index, emp_pct.values, color=["#1A9641"if x=="Active"else"#D7191C"for x in emp_pct.index])for bar, val inzip(bars, emp_pct.values): ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.5,f"{val:.1f}%", ha="center", fontsize=9)ax.set_title("Employment Status (Recorded Only)")ax.set_xlabel("Status")ax.set_ylabel("Percentage")plt.tight_layout()plt.savefig("eda_overview.png", dpi=150, bbox_inches="tight")plt.show()
Code
print("EDA plots rendered.")
EDA plots rendered.
Interpretation: The EDA reveals four critical programme facts. First, the participant population is young (mean age ≈ 30 years), consistent with the programme’s youth employment mandate, though ages range up to 66, suggesting some adult re-skilling participation. Second, certification rates are high across all batches (above 90%), indicating that training delivery is effective at getting participants to completion. Third, only 7% of participants remain unplaced, meaning the placement infrastructure is functioning — but the more pressing issue is retention: among those with a recorded employment status, 55% are active and 45% are inactive. This active employment gap is the central problem this analysis seeks to explain.
6. Analysis — Technique 2: Visualisation
Theory: Data visualisation translates analytical findings into perceptual representations that allow patterns to be understood more rapidly than through tables alone. Effective visualisation for programme analytics follows the principle of encoding the most important comparison as the primary visual channel (Cleveland & McGill, 1984). For categorical outcome data such as employment status, bar charts with percentage encoding are the most reliable form.
Business Justification: As Director of Programmes, visualisation is the primary tool for communicating performance to funders, government partners, and the board. The five plots below are designed to answer the five most frequently asked questions in programme review meetings: which training track performs best, does gender affect outcomes, does education level matter, which LGAs are underperforming, and has performance improved across batches?
Interpretation: Five patterns emerge with direct programmatic implications. First, Information Technology and Business Support participants achieve the highest active employment rates, while Construction and Beauty trail significantly — suggesting a structural mismatch between placements in those sectors and sustained employer demand. Second, gender differences in employment rates are present but modest, indicating that the programme broadly serves both groups without major disparity. Third, tertiary-educated participants achieve materially higher active employment rates than secondary-educated ones — a gap that mentoring or additional bridging support could address. Fourth, LGA-level variation is substantial, pointing to geographic factors (transport, network access, employer concentration) that batch design should account for. Fifth, the batch-level trend shows whether programme improvements over time are translating into better outcomes — a key metric for any programme review.
7. Analysis — Technique 3: Hypothesis Testing
Theory: Hypothesis testing provides a formal statistical framework for determining whether observed differences between groups are likely to reflect true population differences or could plausibly arise from random sampling variation. For categorical variables (such as employment status vs. gender or training category), the Chi-square test of independence is appropriate. It tests whether two categorical variables are statistically independent (Agresti, 2002). Where the Chi-square assumption of expected cell counts ≥ 5 is met, the p-value indicates the probability of observing the data if the null hypothesis (independence) were true. Effect size is measured using Cramér’s V, which ranges from 0 (no association) to 1 (perfect association).
Business Justification: Observed differences in placement and employment rates across training categories, gender, and mentoring status could simply be due to batch composition. As Director of Programmes, I need statistical confirmation before reallocating training budgets or restructuring the mentoring programme. Hypothesis testing provides that confirmation.
library(vcd) # for Cramer's Vdf_test <- df %>%filter(!is.na(Employment_Status))# --- Hypothesis 1: Training Category vs Employment Status ---cat("=== HYPOTHESIS 1 ===\n")
=== HYPOTHESIS 1 ===
Code
cat("H0: Training category and employment status are independent\n")
H0: Training category and employment status are independent
Code
cat("H1: Training category significantly affects employment status\n\n")
H1: Training category significantly affects employment status
from scipy.stats import chi2_contingencyimport pandas as pdimport numpy as npdf_test = df[df["Employment_Status"].notna()].copy()def cramers_v(chi2, n, k):return np.sqrt(chi2 / (n * (k -1)))hypotheses = [ ("Training_Category", "Employment_Status","H1: Training category vs Employment status"), ("Mentoring", "Employment_Status","H2: Mentoring status vs Employment status"), ("Educational_Level", "Employment_Status","H3: Educational level vs Employment status"),]results = []for var1, var2, label in hypotheses: ct = pd.crosstab(df_test[var1], df_test[var2]) chi2, p, dof, expected = chi2_contingency(ct) n = ct.values.sum() k =min(ct.shape) v = cramers_v(chi2, n, k) decision ="Reject H0"if p <0.05else"Fail to Reject H0" results.append({"Hypothesis": label,"Chi2": round(chi2, 3),"df": dof,"p-value": round(p, 4),"Cramer's V": round(v, 4),"Decision": decision })print(f"\n{label}")print(f" H0: The two variables are independent")print(f" Chi2 = {chi2:.3f}, df = {dof}, p = {p:.4f}")print(f" Cramer's V = {v:.4f}")print(f" Decision: {decision}")
H1: Training category vs Employment status
H0: The two variables are independent
Chi2 = 187.759, df = 5, p = 0.0000
Cramer's V = 0.3369
Decision: Reject H0
H2: Mentoring status vs Employment status
H0: The two variables are independent
Chi2 = 673.990, df = 1, p = 0.0000
Cramer's V = 0.6384
Decision: Reject H0
H3: Educational level vs Employment status
H0: The two variables are independent
Chi2 = 18.045, df = 1, p = 0.0000
Cramer's V = 0.1045
Decision: Reject H0
Hypothesis Chi2 df p-value Cramer's V Decision
H1: Training category vs Employment status 187.759 5 0.0 0.3369 Reject H0
H2: Mentoring status vs Employment status 673.990 1 0.0 0.6384 Reject H0
H3: Educational level vs Employment status 18.045 1 0.0 0.1045 Reject H0
Interpretation: All three hypotheses are rejected at the 5% significance level, meaning the observed differences are statistically significant and not attributable to chance. Training category has the strongest association with employment status (Cramér’s V ≈ 0.15), confirming that the track a participant is assigned to materially affects their long-term employment outcome — not just their placement rate. Mentoring status is also significantly associated with employment status, providing statistical support for expanding the mentoring programme rather than treating it as optional. Educational level shows a significant but weaker association, suggesting that the programme’s training content partially compensates for lower entry qualifications — but not entirely. Business action: These results justify a formal policy of universal mentoring coverage and a review of Construction and Beauty track curricula and employer partnerships, based on statistical evidence rather than field observation.
8. Analysis — Technique 4: Correlation Analysis
Theory: Correlation analysis measures the strength and direction of linear relationships between variables. Pearson’s correlation coefficient (r) is used for continuous variables and ranges from −1 (perfect negative relationship) to +1 (perfect positive relationship). Point-biserial correlation is used where one variable is binary (Adi, 2026). A correlation matrix with heatmap allows simultaneous inspection of all pairwise relationships and is standard practice in programme analytics for identifying multicollinearity and key drivers of an outcome variable.
Business Justification: As Director of Programmes, understanding which participant characteristics co-vary — and which co-vary with the employment outcome — informs both targeting decisions and the design of intake assessments. If age and educational level are correlated, for example, segmenting by one may already capture variation in the other, and separate interventions may not be needed.
# Print top correlations with Active employmentcorr_active <-sort(corr_matrix["Active",], decreasing =TRUE)corr_active_df <-data.frame(Variable =names(corr_active),Correlation_with_Active =round(corr_active, 4)) %>%filter(Variable !="Active")corr_active_df %>%kable(caption ="Correlation of All Variables with Active Employment Status") %>%kable_styling(bootstrap_options =c("striped","hover"), full_width =FALSE)
Correlation of All Variables with Active Employment Status
Interpretation: The three most meaningful correlations with active employment status are: (1) Mentored (positive) — mentored participants are more likely to be actively employed, reinforcing the hypothesis testing result; (2) Tertiary education (positive) — higher educational attainment correlates with sustained employment; and (3) Placed (positive) — being formally placed rather than self-employed or unavailable is a prerequisite for active employment, as expected. Age shows a near-zero correlation with active employment, indicating that the programme serves both younger and older participants with similar effectiveness and that age-based targeting is not warranted. Duration of training shows a weak positive correlation, suggesting that longer programmes may lead to marginally better outcomes — though the effect is small and should not be the basis for across-the-board programme extension without further analysis.
9. Analysis — Technique 5: Logistic Regression
Theory: Logistic regression is a classification technique that models the probability of a binary outcome — in this case, active employment (1) versus inactive (0) — as a function of predictor variables (Adi, 2026). Unlike linear regression, logistic regression uses the logit link function to constrain predicted probabilities between 0 and 1. Model coefficients are interpreted as log-odds; exponentiated coefficients (odds ratios) are more intuitive for a business audience. Model performance is assessed using the confusion matrix, classification accuracy, and the Area Under the ROC Curve (AUC), where AUC = 0.5 indicates no discriminatory power and AUC = 1 indicates perfect discrimination.
Business Justification: As Director of Programmes, the most consequential decision I make is which applicant profiles to prioritise during batch recruitment. A logistic regression model that identifies the statistically significant predictors of active employment provides a defensible, replicable scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of data-driven programme management.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Interpretation: The logistic regression model achieves reasonable discriminatory performance. The three most significant predictors of active employment are: (1) Mentoring — mentored participants have substantially higher odds of being actively employed, all else equal. This is the single strongest actionable predictor, since mentoring is a programme-controlled variable. (2) Tertiary education — tertiary-educated participants have higher odds of active employment compared to secondary-educated participants, consistent with broader labour market evidence. (3) Training Category — relative to Business Support (the reference category), certain tracks show significantly lower odds of active employment, identifying specific tracks for curriculum and employer partnership review. Age and training duration are not statistically significant predictors once the other variables are controlled for, meaning they should not be used as selection criteria. Business action: Universal mentoring coverage should be the first policy change implemented, as it is the only significant predictor that the programme fully controls.
10. Integrated Findings
The five analyses conducted in this report converge on a coherent and actionable narrative. The ESP programme successfully certifies the vast majority of its participants (93%) and places most of them (85%), which represents strong operational performance at the training and initial placement stages. However, the active employment rate — the true measure of programme impact — stands at approximately 55% among those with a recorded employment status. This gap between placement and sustained employment is the central challenge, and the analyses collectively identify its drivers.
EDA established the baseline: the programme serves a young, predominantly female, largely tertiary-educated participant population distributed across Lagos LGAs, with training tracks ranging from Hospitality to Information Technology. Visualisation revealed that active employment rates vary significantly by training category, with IT and Business Support leading and Construction and Beauty lagging. Hypothesis testing confirmed statistically that training category, mentoring status, and educational level all have significant and non-random associations with employment status. Correlation analysis showed that mentoring and tertiary education are the variables most positively correlated with active employment, while age and training duration have negligible relationships with the outcome. Logistic regression identified mentoring as the single strongest programme-controlled predictor, with tertiary education and training category also contributing significantly.
Integrated Recommendation: As Director of Programmes, I recommend three data-driven changes for the next batch cycle. First, universal mentoring coverage should be mandated — the statistical evidence is unambiguous that mentored participants sustain employment at higher rates, and the current 66% mentoring coverage leaves a third of participants without the most effective support available. Second, training track investment should be rebalanced toward IT and Business Support, where active employment rates are highest and employer demand is demonstrably more durable. Third, a bridging support module should be designed specifically for secondary-educated participants to close the educational level gap in employment outcomes that persists even after controlling for other factors.
11. Limitations & Further Work
Sample limitations: The dataset covers a single programme (ESP) in a single state (Lagos) over a 17-month window. Findings may not generalise to other states, different programme structures, or post-2022 labour market conditions which may have been affected by macroeconomic shifts.
Missing employment status data: 350 participants (17%) have no recorded employment status, likely due to post-placement tracking attrition. If participants with missing employment status are systematically different from those with recorded status — for example, if harder-to-reach participants are disproportionately inactive — the true active employment rate may be lower than the 55% observed in the complete cases. Future programme iterations should invest in automated follow-up tools (SMS, USSD) to improve tracking coverage.
Observational design: This is a cross-sectional observational dataset, not a randomised experiment. The positive association between mentoring and active employment could partly reflect selection effects — if programme staff assign mentors to participants already perceived as more motivated or capable, the mentoring coefficient in the regression overstates the causal effect of mentoring itself. A randomised mentoring allocation in a pilot batch would provide cleaner causal evidence.
Single outcome variable: Active vs. inactive employment is a binary measure that does not capture income level, job quality, career progression, or alignment between training received and job performed. Future M&E design should incorporate salary data, job-skill match ratings, and six-month and twelve-month follow-up points.
Further work: With more time and computing resources, a survival analysis (Kaplan-Meier or Cox proportional hazards) modelling time to inactivity would be more informative than a binary active/inactive snapshot. Additionally, a random forest or gradient boosting model could capture non-linear interactions between participant characteristics that logistic regression misses.
References
Adi, B. (2026). Data analytics for business decision-making. Lagos Business School Press.
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley-Interscience.
Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.2307/2288400
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
R packages used:
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix (R package version 0.92). https://github.com/taiyun/corrplot
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyse and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77
Python packages used:
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51–56.
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesner, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … van Mulbregt, P. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Dataset:
Bisola Dere. (2026). State Employment Skills Programme (ESP) monitoring and evaluation database, Batches 1–4 [Primary dataset]. [PaTiTi Consulting].
Appendix: AI Usage Statement
Claude (Anthropic) was used in the preparation of this document for the following purposes: generating initial Quarto document structure and YAML configuration; suggesting appropriate R and Python package combinations for each analytical technique; and debugging rendering issues in the panel-tabset layout. GitHub Copilot was used for autocomplete assistance during code writing, particularly for ggplot2 and seaborn syntax.
Independent analytical judgement was exercised in all of the following: the selection of the research question and outcome variable; the decision to use logistic regression rather than a more complex model given the 10-day timeline and interpretability requirements; the choice of Business Support as the reference category in the regression; the identification and documentation of the missing employment status data quality issue; the interpretation of all statistical outputs in the context of programme operations; and the formulation of the three integrated recommendations. All business interpretations are the author’s own and reflect direct professional knowledge of the ESP programme.