Predicting Active Employment Outcomes in UpSkilling Programmes
Author
B. Dere
Published
May 20, 2026
1. Executive Summary
This study analyses data from various vocational training and apprenticeship initiatives that enrolled approximately 1,000 participants across multiple cohorts between 2021 and 2024. The central business question is: which participant profiles and training categories are most likely to result in active employment, and how can this knowledge make programme design, trainer allocation, and apprenticeship partnership decisions more data-driven?
Data were extracted from a pooled anonymised programme database covering 20 variables including demographic characteristics, training track, career coaching status, certification outcome, apprenticeship status, and final employment status. Six training categories were covered: Business Support Services (BSS), Information and Communications Technology (ICT), Beauty (BTY), Construction (CTN), Fashion (FSN), and Hospitality (HSP). Five analytical techniques were applied: Exploratory Data Analysis (EDA), Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression. Key findings reveal that training category and educational level are the strongest structural predictors of active employment, while career coaching assignment reflects a selection effect that warrants further causal investigation. Participants in ICT and BSS tracks and those with tertiary education consistently show higher rates of active employment. The integrated recommendation is that future cohorts should invest in ICT and BSS tracks and design a randomised career coaching pilot to establish true causal effect.
2. Professional Disclosure
Job Title: Director of Programmes Organisation/Sector: PC LTD | Skills Development / Social Impact
Exploratory Data Analysis (EDA): As Director of Programmes, understanding who is enrolling in the programme before making design decisions is foundational. EDA allows me to monitor whether the programme is reaching its intended demographic — young, low-income residents within the state — and to detect shifts in participant profiles across cohorts that would require a programmatic response. Without this diagnostic layer, decisions on recruitment targeting and resource allocation are based on assumption rather than evidence.
Visualisation: Reporting to funders, government partners, and the board requires communicating programme performance clearly and efficiently. Visualisation of apprenticeship rates by training category, gender, and location gives me a dashboard-ready view that directly informs which tracks to scale, discontinue, or redesign in future cohorts. It also enables me to present evidence to partners in a form that drives decisions rather than simply informs them.
Hypothesis Testing: A recurring operational question I face is whether observed differences in outcomes — for example, whether coached participants achieve employment at higher rates than uncoached ones, or whether one training category outperforms another — are statistically real or simply sampling noise. Hypothesis testing provides the rigorous basis needed to act on those differences, justify resource reallocation, and defend programmatic changes to stakeholders.
Correlation Analysis: As Director, I allocate career coaching resources and decide which participant segments receive intensive support. Correlation analysis between age, educational level, training duration, and employment outcomes allows me to test whether those targeting assumptions are evidence-based or inherited from previous programme designs. Understanding which variables move together informs more precise targeting in future cohorts.
Logistic Regression: The most consequential decision I make each cycle is which applicant profiles to prioritise during recruitment. A logistic regression model identifying the strongest predictors of active employment provides a defensible, data-driven scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of systematic programme improvement.
3. Data Collection & Sampling
Source: UpSkilling Programmes pooled anonymised monitoring and evaluation (M&E) database, maintained by the Directorate of Programmes.
Collection Method: The dataset was extracted from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a state within Nigeria. No third-party data was used.
Sampling Frame: Participants registered across multiple programme cohorts between 2021 and 2024. A stratified random sub-sample of 1,000 participants was drawn to preserve the Active/Inactive employment ratio of the full population while protecting the identity of any individual cohort or project.
Sample Size: 1,000 observations (participants) across 20 variables.
Time Period: January 2021 to December 2024 (approximately 4 years).
Variables: The dataset includes career coaching status, enrollment ID, qualification, educational level, age, age bracket, location code, marital status, gender, training category, training sub-category, programme start and end dates, training status, apprenticeship status, reason if declined, date of employment, employment type, employment status, and job formality classification.
Training Category Codes: BSS = Business Support Services; ICT = Information and Communications Technology; BTY = Beauty; CTN = Construction; FSN = Fashion; HSP = Hospitality.
Training Sub-Category Codes: ABG = Agent Banking; BC = Bakery and Confectioneries; FSN = Fashion; DGM = Digital Marketing; ELC = Electrical; PLM = Plumbing; CSM = Cosmetology; HDR = Hair Dressing; TLG = Tiling; CSR = Customer Service; MSN = Masonry; S&M = Sales and Marketing.
Ethical & Consent Notes: The dataset used in this analysis is a randomly drawn sub-sample from a pooled anonymised database spanning multiple skills development programme cohorts delivered across a Nigerian state between 2021 and 2024. All personally identifiable information — including names, national identification numbers, phone numbers, and employer details — has been removed prior to extraction. Enrollment IDs are synthetic identifiers assigned for analytical purposes only and do not correspond to any original programme record. Geographic identifiers have been replaced with randomised location codes (e.g. LGA 14, LGA 80, LGA 144) that do not correspond to any named administrative unit and do not reveal the state or region of programme delivery. The pooled dataset design means that no single project, cohort, or organisation can be identified from the analytical extract. This analysis has been conducted in accordance with the Nigerian Data Protection Act 2023 (NDPA) and the data governance principles of the originating organisation.
Data Quality Issues Identified: 1. Missing Employment Status: Participants with no recorded employment status were excluded from the analytical sample. These are predominantly participants still within the programme cycle at the time of data extraction or those who became unreachable during follow-up. 2. Age Outlier: One participant is recorded as age 1, which is clearly a data entry error. This record is excluded from age-related analyses.
for col in ["Gender","Educational_Level","Training_Status","Apprenticeship_Status","Employment_Status","Career_Coaching","Training_Category","Training_SubCategory"]:print(f"\n{col}:")print(df[col].value_counts(dropna=False).to_string())
Interpretation: The dataset covers approximately 1,000 participants with a mean age of approximately 30 years, reflecting a young workforce target population. Female participants (approximately 61%) outnumber male participants (39%). The majority hold tertiary qualifications and over 90% achieved certification. Approximately 56% of participants in the sample are actively employed at the time of data extraction, pointing to a significant post-apprenticeship retention or engagement challenge that this analysis seeks to understand. The six training categories — BSS, ICT, BTY, CTN, FSN, and HSP — span a broad range of vocational skills, with BSS being the largest track by enrolment.
5. Analysis — Technique 1: Exploratory Data Analysis
Theory: Exploratory Data Analysis (EDA) is the practice of summarising a dataset’s main characteristics, often using statistical and visual methods, before formal modelling. It is grounded in the tradition established by Tukey (1977) and is used to detect patterns, anomalies, and relationships that guide subsequent analysis. EDA is particularly important for administrative datasets where data quality issues are common and assumptions about the population may not hold.
Business Justification: As Director of Programmes, EDA provides the foundation for all downstream decisions. Before allocating resources, adjusting training tracks, or redesigning recruitment, I need to know who is actually in the programme, how they are distributed across demographic categories, and where the data has gaps. EDA surfaces these facts systematically rather than relying on anecdotal reports from field staff.
library(ggplot2)library(scales)# Age distributionp1 <-ggplot(df %>%filter(!is.na(Age)), aes(x = Age)) +geom_histogram(binwidth =5, fill ="#2C7BB6", colour ="white", alpha =0.85) +geom_vline(aes(xintercept =mean(Age, na.rm =TRUE)),colour ="#D7191C", linetype ="dashed", linewidth =1) +labs(title ="Age Distribution of USP Participants",subtitle ="Dashed line = mean age (≈30 years)",x ="Age (years)", y ="Count") +theme_minimal(base_size =13)print(p1)
Code
# Apprenticeship status distributionapp_counts <- df %>%count(Apprenticeship_Status) %>%mutate(pct = n /sum(n) *100)p2 <-ggplot(app_counts,aes(x =reorder(Apprenticeship_Status, -n), y = pct,fill = Apprenticeship_Status)) +geom_col(show.legend =FALSE) +geom_text(aes(label =paste0(round(pct, 1), "%")), vjust =-0.5, size =4) +labs(title ="Apprenticeship Status of USP Participants",x ="Apprenticeship Status", y ="Percentage (%)") +theme_minimal(base_size =13)print(p2)
Code
# Employment status breakdownemp_counts <- df %>%filter(!is.na(Employment_Status)) %>%count(Employment_Status) %>%mutate(pct = n /sum(n) *100)p3 <-ggplot(emp_counts, aes(x = Employment_Status, y = pct, fill = Employment_Status)) +geom_col(show.legend =FALSE) +scale_fill_manual(values =c("Active"="#1A9641", "Inactive"="#D7191C")) +geom_text(aes(label =paste0(round(pct, 1), "%")), vjust =-0.5, size =4) +labs(title ="Employment Status Among USP Participants",subtitle ="Participants with no recorded employment status excluded",x ="Employment Status", y ="Percentage (%)") +theme_minimal(base_size =13)print(p3)
Code
# Training category distributioncat_counts <- df %>%count(Training_Category) %>%mutate(pct = n /sum(n) *100)p4 <-ggplot(cat_counts, aes(x =reorder(Training_Category, pct), y = pct, fill = pct)) +geom_col(show.legend =FALSE) +geom_text(aes(label =paste0(round(pct, 1), "%")), hjust =-0.1, size =4) +coord_flip() +scale_fill_gradient(low ="#FDAE61", high ="#2C7BB6") +labs(title ="Participant Distribution by Training Category",subtitle ="BSS=Business Support Services, ICT=Information & Comms Tech,\nBTY=Beauty, CTN=Construction, FSN=Fashion, HSP=Hospitality",x ="Training Category", y ="Percentage (%)") +theme_minimal(base_size =13) +ylim(0, 35)print(p4)
Code
import matplotlib.pyplot as pltfig, axes = plt.subplots(2, 2, figsize=(14, 10))fig.suptitle("EDA — USP Programme Overview", fontsize=15, fontweight="bold", y=1.01)# Age distributionax = axes[0, 0]age_clean = df["Age"].dropna()ax.hist(age_clean, bins=15, color="#2C7BB6", edgecolor="white", alpha=0.85)ax.axvline(age_clean.mean(), color="#D7191C", linestyle="--", linewidth=1.5)ax.set_title("Age Distribution")ax.set_xlabel("Age (years)")ax.set_ylabel("Count")# Training Category distributionax = axes[0, 1]cat_counts = df["Training_Category"].value_counts()cat_pct = cat_counts / cat_counts.sum() *100bars = ax.barh(cat_pct.index, cat_pct.values, color="#2C7BB6")for bar, val inzip(bars, cat_pct.values): ax.text(val +0.3, bar.get_y() + bar.get_height() /2,f"{val:.1f}%", va="center", fontsize=9)ax.set_title("Training Category Distribution\n(BSS/ICT/BTY/CTN/FSN/HSP)")ax.set_xlabel("Percentage (%)")# Apprenticeship Statusax = axes[1, 0]pl_counts = df["Apprenticeship_Status"].value_counts()pl_pct = pl_counts / pl_counts.sum() *100bars = ax.bar(pl_pct.index, pl_pct.values, color="#2C7BB6")for bar, val inzip(bars, pl_pct.values): ax.text(bar.get_x() + bar.get_width() /2, bar.get_height() +0.5,f"{val:.1f}%", ha="center", fontsize=9)ax.set_title("Apprenticeship Status (%)")ax.set_xlabel("Status")ax.set_ylabel("Percentage")ax.tick_params(axis="x", rotation=15)# Employment Statusax = axes[1, 1]emp = df["Employment_Status"].dropna().value_counts()emp_pct = emp / emp.sum() *100bars = ax.bar(emp_pct.index, emp_pct.values, color=["#1A9641"if x =="Active"else"#D7191C"for x in emp_pct.index])for bar, val inzip(bars, emp_pct.values): ax.text(bar.get_x() + bar.get_width() /2, bar.get_height() +0.5,f"{val:.1f}%", ha="center", fontsize=9)ax.set_title("Employment Status (Recorded Only)")ax.set_xlabel("Status")ax.set_ylabel("Percentage")plt.tight_layout()plt.savefig("eda_overview.png", dpi=150, bbox_inches="tight")plt.show()
Code
print("EDA plots rendered.")
EDA plots rendered.
Interpretation: The EDA reveals four critical programme facts. First, the participant population is young (mean age ≈ 30 years), consistent with the programme’s youth employment mandate, though ages range up to 66, suggesting some adult re-skilling participation. Second, certification rates are high (above 90%), indicating that training delivery is effective at getting participants to completion. Third, the apprenticeship infrastructure is functioning well — the more pressing issue is retention: among those with a recorded employment status, approximately 56% are active and 44% are inactive. This active employment gap is the central problem this analysis seeks to explain. BSS is the largest training track by enrolment, followed by CTN and FSN.
6. Analysis — Technique 2: Visualisation
Theory: Data visualisation translates analytical findings into perceptual representations that allow patterns to be understood more rapidly than through tables alone. Effective visualisation for programme analytics follows the principle of encoding the most important comparison as the primary visual channel (Cleveland & McGill, 1984). For categorical outcome data such as employment status, bar charts with percentage encoding are the most reliable form.
Business Justification: As Director of Programmes, visualisation is the primary tool for communicating performance to funders, government partners, and the board. The plots below are designed to answer the most frequently asked questions in programme review meetings: which training track performs best, does gender affect outcomes, does education level matter, and which locations are underperforming?
Interpretation: Four patterns emerge with direct programmatic implications. First, ICT and BSS participants achieve the highest active employment rates, while CTN and BTY trail significantly — suggesting a structural mismatch between apprenticeships in those sectors and sustained employer demand. Second, gender differences in employment rates are present but modest, indicating that the programme broadly serves both groups without major disparity. Third, tertiary-educated participants achieve materially higher active employment rates than secondary-educated ones — a gap that additional bridging support could address. Fourth, location-level variation is substantial, pointing to geographic factors (transport, network access, employer concentration) that programme design should account for.
7. Analysis — Technique 3: Hypothesis Testing
Theory: Hypothesis testing provides a formal statistical framework for determining whether observed differences between groups are likely to reflect true population differences or could plausibly arise from random sampling variation. For categorical variables (such as employment status vs. training category or career coaching status), the Chi-square test of independence is appropriate. It tests whether two categorical variables are statistically independent (Agresti, 2002). Where the Chi-square assumption of expected cell counts ≥ 5 is met, the p-value indicates the probability of observing the data if the null hypothesis (independence) were true. Effect size is measured using Cramér’s V, which ranges from 0 (no association) to 1 (perfect association).
Business Justification: Observed differences in apprenticeship and employment rates across training categories, gender, and career coaching status could simply be due to cohort composition. As Director of Programmes, I need statistical confirmation before reallocating training budgets or restructuring the career coaching programme. Hypothesis testing provides that confirmation.
from scipy.stats import chi2_contingencyimport pandas as pdimport numpy as npdf_test = df[df["Employment_Status"].notna()].copy()def cramers_v(chi2, n, k):return np.sqrt(chi2 / (n * (k -1)))hypotheses = [ ("Training_Category", "Employment_Status","H1: Training category vs Employment status"), ("Career_Coaching", "Employment_Status","H2: Career coaching status vs Employment status"), ("Educational_Level", "Employment_Status","H3: Educational level vs Employment status"),]results = []for var1, var2, label in hypotheses: ct = pd.crosstab(df_test[var1], df_test[var2]) chi2, p, dof, expected = chi2_contingency(ct) n = ct.values.sum() k =min(ct.shape) v = cramers_v(chi2, n, k) decision ="Reject H0"if p <0.05else"Fail to Reject H0" results.append({"Hypothesis": label,"Chi2": round(chi2, 3),"df": dof,"p-value": round(p, 4),"Cramer's V": round(v, 4),"Decision": decision })print(f"\n{label}")print(f" H0: The two variables are independent")print(f" Chi2 = {chi2:.3f}, df = {dof}, p = {p:.4f}")print(f" Cramer's V = {v:.4f}")print(f" Decision: {decision}")
H1: Training category vs Employment status
H0: The two variables are independent
Chi2 = 119.975, df = 5, p = 0.0000
Cramer's V = 0.3465
Decision: Reject H0
H2: Career coaching status vs Employment status
H0: The two variables are independent
Chi2 = 403.183, df = 1, p = 0.0000
Cramer's V = 0.6353
Decision: Reject H0
H3: Educational level vs Employment status
H0: The two variables are independent
Chi2 = 15.330, df = 1, p = 0.0001
Cramer's V = 0.1239
Decision: Reject H0
Hypothesis Chi2 df p-value Cramer's V Decision
H1: Training category vs Employment status 119.975 5 0.0000 0.3465 Reject H0
H2: Career coaching status vs Employment status 403.183 1 0.0000 0.6353 Reject H0
H3: Educational level vs Employment status 15.330 1 0.0001 0.1239 Reject H0
Interpretation: All three hypotheses are rejected at the 5% significance level, meaning the observed differences are statistically significant and not attributable to chance. Training category has a meaningful association with employment status, confirming that the track a participant is assigned to materially affects their long-term employment outcome. Career coaching status is also significantly associated with employment status — though the direction of this relationship, explored further in the correlation and regression analyses, points to a selection effect rather than a straightforward causal benefit. Educational level shows a significant but weaker association, suggesting that the programme’s training content partially compensates for lower entry qualifications — but not entirely. Business action: These results justify a review of training track investment toward ICT and BSS, and a randomised career coaching pilot to establish true causal direction.
8. Analysis — Technique 4: Correlation Analysis
Theory: Correlation analysis measures the strength and direction of linear relationships between variables. Pearson’s correlation coefficient (r) is used for continuous variables and ranges from −1 (perfect negative relationship) to +1 (perfect positive relationship). Point-biserial correlation is used where one variable is binary (Adi, 2026). A correlation matrix with heatmap allows simultaneous inspection of all pairwise relationships and is standard practice in programme analytics for identifying multicollinearity and key drivers of an outcome variable.
Business Justification: As Director of Programmes, understanding which participant characteristics co-vary — and which co-vary with the employment outcome — informs both targeting decisions and the design of intake assessments. If age and educational level are correlated, for example, segmenting by one may already capture variation in the other, and separate interventions may not be needed.
# Top correlations with Active employmentcorr_active <-sort(corr_matrix["Active", ], decreasing =TRUE)corr_active_df <-data.frame(Variable =names(corr_active),Correlation_with_Active =round(corr_active, 4)) %>%filter(Variable !="Active")corr_active_df %>%kable(caption ="Correlation of All Variables with Active Employment Status") %>%kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)
Correlation of All Variables with Active Employment Status
Interpretation: Female participants show the strongest positive correlation with active employment. Coached shows a strong negative correlation — likely reflecting selection bias, where career coaching is assigned to at-risk participants who, despite receiving support, remain less likely to achieve active employment. This does not mean career coaching is ineffective; it means the programme targets coaching at harder-to-place participants, which suppresses the raw correlation. Tertiary education and Age show weak negative associations, suggesting that within this programme population, demographic characteristics alone are poor predictors of employment outcome.
9. Analysis — Technique 5: Logistic Regression
Theory: Logistic regression is a classification technique that models the probability of a binary outcome — in this case, active employment (1) versus inactive (0) — as a function of predictor variables (Adi, 2026). Unlike linear regression, logistic regression uses the logit link function to constrain predicted probabilities between 0 and 1. Model coefficients are interpreted as log-odds; exponentiated coefficients (odds ratios) are more intuitive for a business audience. Model performance is assessed using the confusion matrix, classification accuracy, and the Area Under the ROC Curve (AUC), where AUC = 0.5 indicates no discriminatory power and AUC = 1 indicates perfect discrimination.
Business Justification: As Director of Programmes, the most consequential decision I make is which applicant profiles to prioritise during recruitment. A logistic regression model that identifies the statistically significant predictors of active employment provides a defensible, replicable scoring framework for future selection decisions — one that can be presented to government partners and donors as evidence of data-driven programme management.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Interpretation: The model achieves strong discriminatory performance. The Coached coefficient is strongly negative, consistent with the correlation finding — reflecting selection bias rather than a causal negative effect of career coaching. Coaches are assigned to participants already identified as at risk of poor employment outcomes, which drives the negative coefficient. Training category is the strongest structural predictor: relative to BSS (the reference category), other tracks show varying odds of active employment. These training category effects are the most actionable finding for programme design decisions. Age and gender are not statistically significant predictors at the 5% level.
10. Integrated Findings
The five analyses conducted in this report converge on a coherent and actionable narrative. The programme successfully certifies the vast majority of its participants and matches most of them to apprenticeship opportunities, which represents strong operational performance at the training and initial placement stages. However, the active employment rate — the true measure of programme impact — stands at approximately 56% among those with a recorded employment status. This gap between apprenticeship matching and sustained employment is the central challenge, and the analyses collectively identify its drivers.
EDA established the baseline: the programme serves a young, predominantly female, largely tertiary-educated participant population distributed across multiple locations within the selected state, spanning six training tracks — BSS, ICT, BTY, CTN, FSN, and HSP. Visualisation revealed that active employment rates vary significantly by training category, with ICT and BSS leading and CTN and BTY lagging. Hypothesis testing confirmed statistically that training category, career coaching status, and educational level all have significant and non-random associations with employment status. Correlation analysis revealed that career coaching shows a strong negative correlation, consistent with a selection bias pattern — coaches are assigned to harder-to-place participants. Logistic regression confirmed training category as the strongest structural predictor of active employment, with the Coached coefficient reflecting selection bias rather than a causal negative effect.
Integrated Recommendation: As Director of Programmes, I recommend three data-driven changes for the next programme cycle. First, a randomised career coaching pilot should be designed to establish the true causal effect of coaching, separate from the selection bias observed in this dataset. Second, training track investment should be rebalanced toward ICT and BSS, where active employment rates are highest and employer demand is demonstrably more durable. Third, a bridging support module should be designed specifically for secondary-educated participants to close the educational level gap in employment outcomes that persists even after controlling for other factors.
11. Limitations & Further Work
Sample limitations: The dataset covers a pooled sample from programmes delivered within a selected state over a four-year window. Findings may not generalise to other states, different programme structures, or labour market conditions which may have been affected by macroeconomic shifts.
Missing employment status data: A proportion of participants had no recorded employment status, likely due to post-apprenticeship tracking attrition, and were excluded from the analytical sample. Future programme iterations should invest in automated follow-up tools (SMS, USSD) to improve tracking coverage.
Observational design: This is a cross-sectional observational dataset, not a randomised experiment. The negative association between career coaching and active employment observed in this dataset is most likely driven by selection effects — coaches are assigned to at-risk participants who remain less likely to achieve active employment regardless of support received. A randomised career coaching allocation in a pilot cohort would provide cleaner causal evidence of coaching’s true effect.
Single outcome variable: Active vs. inactive employment is a binary measure that does not capture income level, job quality, career progression, or alignment between training received and job performed. Future M&E design should incorporate salary data, job-skill match ratings, and six-month and twelve-month follow-up points.
Further work: With more time and computing resources, a survival analysis (Kaplan-Meier or Cox proportional hazards) modelling time to inactivity would be more informative than a binary active/inactive snapshot. Additionally, a random forest or gradient boosting model could capture non-linear interactions between participant characteristics that logistic regression misses.
References
Adi, B. (2026). Data analytics for business decision-making. Lagos Business School Press.
Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley-Interscience.
Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531–554. https://doi.org/10.2307/2288400
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
R packages used:
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wei, T., & Simko, V. (2021). corrplot: Visualization of a correlation matrix (R package version 0.92). https://github.com/taiyun/corrplot
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyse and compare ROC curves. BMC Bioinformatics, 12(1), 77. https://doi.org/10.1186/1471-2105-12-77
Python packages used:
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 445, 51–56.
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesner, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … van Mulbregt, P. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Dataset:
B. Dere. (2026). UpSkilling Programmes monitoring and evaluation database [Anonymised sub-sample]. [PC LTD].
Appendix: AI Usage Statement
Claude (Anthropic) was used in the preparation of this document for the following purposes: generating initial Quarto document structure and YAML configuration; suggesting appropriate R and Python package combinations for each analytical technique; and debugging rendering issues in the panel-tabset layout. GitHub Copilot was used for autocomplete assistance during code writing, particularly for ggplot2 and seaborn syntax.
Independent analytical judgement was exercised in all of the following: the selection of the research question and outcome variable; the decision to use logistic regression rather than a more complex model given the timeline and interpretability requirements; the choice of BSS as the reference category in the regression; the identification and documentation of data quality issues; the interpretation of all statistical outputs in the context of programme operations; and the formulation of the three integrated recommendations. All business interpretations are the author’s own and reflect direct professional knowledge of the programme.