Interpretable Predictive Segmentation for Doctoral LIS Workforce Planning

A theory-informed, reproducible analysis of expressed doctoral study interest in the Philippine Librarians Census

Author

Affiliation

Dan Anthony Dorado

School of Library and Information Studies

Published

May 21, 2026

Abstract

Doctoral education in library and information science (LIS) is a workforce-development mechanism through which the profession builds research capacity, academic leadership, and evidence-informed institutional practice. Yet little is known about how structural opportunity, professional capital, and institutional access conditions shape expressed interest in doctoral LIS study within national librarian populations. This article develops an interpretable predictive segmentation framework using the Philippine Librarians Census to classify expressed doctoral study interest and translate empirically derived workforce segments into ethically bounded recruitment personas. The study uses a cross-sectional, sequential predictive segmentation design combining descriptive workforce profiling, interpretable machine-learning classification, calibration and subgroup diagnostics, segmentation analysis, and persona translation safeguards. The findings show that expressed doctoral study interest is patterned across educational, professional, institutional, and geographic contexts, supporting the use of predictive analytics as a planning tool rather than as an admissions, ranking, or individual forecasting system. The article contributes to LIS workforce research by demonstrating how national professional census data can support doctoral pipeline planning while preserving inferential restraint, interpretability, fairness awareness, and non-exclusionary governance.

Keywords

library and information science education, doctoral education, predictive analytics, machine learning, workforce planning, recruitment personas, Philippine librarians

1 Introduction

1.1 Doctoral Workforce Development in LIS

Doctoral education in library and information science (LIS) is not merely an advanced credentialing activity. It is part of the workforce infrastructure through which the profession develops future researchers, educators, policy actors, institutional leaders, and evidence-oriented practitioners. In national LIS systems where professional work is distributed across institutional sectors and geographic regions, doctoral program planning is therefore both an educational-design problem and a problem of structural access to professional mobility.

This article develops an interpretable predictive segmentation framework for doctoral LIS workforce planning. It does not attempt to predict actual enrollment, admissions suitability, individual capability, or professional worth. Its narrower target is expressed doctoral study interest: a self-reported signal that can support institutional planning but cannot substitute for longitudinal enrollment evidence. This construct separation is central. Expressed interest, professional preparedness, institutional reachability, recruitment feasibility, and realized enrollment are related but analytically distinct phenomena.

1.2 Structural Inequality in Graduate Education Access

The study assumes that doctoral interest is socially situated. Expressed interest may be amplified by prior access to graduate education, research mentoring, professional networks, institutional encouragement, and geographic proximity to advanced study. It may also be suppressed by cost, workload, distance, limited information, or low perceived feasibility. Thus, a weak or absent expressed-interest signal should not be read as lack of potential. It may reflect unequal opportunity to imagine doctoral study as a realistic pathway.

1.3 Philippine LIS Workforce Context

The study is grounded in the Philippine Librarians Census, a landmark national account of the librarian workforce (Obille & Dorado, 2022). Subsequent scholarship based on the same professional landscape shows the value of national workforce evidence for understanding the structure, distribution, and policy relevance of librarianship in the Philippines (Dorado, 2024). The Philippine context makes doctoral planning especially important because professional opportunities, graduate education access, institutional support, and research exposure are likely to vary across regions and employment settings.

The problem addressed here is not simply whether there is “demand” for a PhD in LIS. Demand is too broad a construct for the available cross-sectional evidence. The more defensible problem is how national workforce evidence can identify professional profiles associated with expressed doctoral study interest and how those profiles can be translated into ethically bounded recruitment personas without exceeding the limits of observational data.

1.4 Problem Statement

Despite the strategic importance of doctoral education to LIS workforce sustainability, little evidence exists regarding how structural inequalities, professional capital, and institutional access conditions shape expressed doctoral study interest within national librarian populations. Existing LIS workforce studies remain predominantly descriptive and provide limited insight into how interpretable predictive analytics may support ethically bounded doctoral pipeline planning without reproducing exclusionary educational logics.

This study develops and evaluates an interpretable predictive segmentation framework using national Philippine librarian census data to identify statistically distinguishable professional profiles associated with expressed doctoral study interest. Rather than treating machine learning as an admissions, ranking, or selection mechanism, the study positions predictive segmentation as a bounded institutional planning tool for understanding heterogeneous patterns of doctoral aspiration across professional contexts.

The study further examines whether empirically derived segments can be translated into recruitment-relevant personas while preserving the inferential limits of cross-sectional workforce data. In this framing, personas are not latent psychological types, real individuals, or deterministic categories. They are interpretive communication abstractions derived from empirical segments and intended for non-exclusionary program planning.

1.5 Research Questions and Objectives

The study is guided by four research questions, each aligned with a different inferential level:

What demographic, educational, professional, and institutional characteristics are associated with expressed interest in doctoral LIS study among Filipino librarians?
How accurately can interpretable machine-learning models classify expressed doctoral study interest using workforce and professional profile variables?
What stable workforce profile segments emerge among librarians with elevated predicted probabilities of expressed doctoral study interest?
How can empirically derived workforce segments be translated into ethically bounded recruitment personas for institutional doctoral pipeline planning?

The corresponding objectives are to construct a reproducible R workflow, evaluate interpretable prediction models, identify segmentation patterns associated with expressed doctoral study interest, and translate stable segment patterns into planning-oriented personas with explicit safeguards against overinterpretation.

1.6 Study Scope and Boundaries

This is a cross-sectional predictive segmentation study. It supports classification of expressed interest patterns and strategic interpretation of professional heterogeneity. It does not support causal claims about why librarians pursue doctoral study, individual forecasting of future enrollment, admissions decisions, merit ranking, or exclusionary recruitment.

The study identifies statistically observable patterns associated with expressed doctoral interest but does not directly model the psychological or sociological mechanisms underlying aspiration formation.

The article makes three contributions. First, it demonstrates an interpretable workforce analytics approach for doctoral planning. Second, it develops an ethically bounded approach to educational segmentation that rejects admissions, ranking, and exclusionary uses. Third, it provides evidence that structural inequality and professional opportunity conditions shape the visibility of doctoral study interest in a national LIS workforce.

2 Review of Related Literature

2.1 Professionalization and Doctoral Workforce Development

LIS education occupies a distinctive institutional position because it links universities, professional regulation, public knowledge institutions, and labor-market transformation. It is not simply a sequence of credentials. It is part of the infrastructure through which a profession renews its expertise, develops future educators and researchers, and responds to changes in information work. Birdi describes LIS education as an academic field that bridges higher education and LIS professional practice, a framing that is especially important when doctoral education is treated as both scholarly preparation and professional capacity-building (Birdi, 2022).

Professionalization theory strengthens this framing. Abbott’s system of professions locates professional development within struggles over expertise, jurisdiction, and the organization of expert labor (Abbott, 1988). Freidson’s account of professionalism similarly treats professional knowledge and judgment as a distinctive organizing logic (Freidson, 2001). From this perspective, doctoral LIS education is not only an individual advancement pathway; it is also a mechanism through which the profession develops research legitimacy, leadership capacity, and claims to specialized knowledge.

This dual role makes doctoral program planning inseparable from workforce planning. A PhD in LIS is not only an advanced academic offering; it is also an intervention in the future supply of researchers, faculty members, policy leaders, system designers, and evidence-oriented practitioners. International LIS literature has repeatedly connected curriculum design and professional education to perceived workforce needs. Katuli-Munyoro and Mutula, for example, frame LIS education reform around the problem of closing workforce skills gaps, emphasizing that educational programs must remain responsive to employer and professional expectations (Katuli-Munyoro & Mutula, 2017). Regional accounts of LIS education similarly show that doctoral provision is shaped by questions of access, quality, professional readiness, and the capacity of working professionals to pursue advanced study (Panigrahi, 2010).

For the Philippine context, this literature supports a strategic premise: planning a PhD in LIS requires more than documenting general interest in graduate study. It requires a structured understanding of which professional audiences are most likely to recognize doctoral education as feasible, valuable, and aligned with their career trajectories. The relevant planning question is therefore not merely whether a doctoral program is desirable, but how demand is distributed across the profession and how institutional design can respond to that distribution.

2.2 Educational Aspiration and Access Theory

The construct of expressed doctoral study interest requires an access-oriented theory of aspiration. Bourdieu’s forms of capital explain why educational ambitions are shaped by accumulated cultural, social, economic, and symbolic resources (Bourdieu, 1986). Perna’s model of college access and choice further emphasizes that educational decisions are embedded in layered contexts rather than reducible to individual preference (Perna, 2006). Eccles and Wigfield’s work on motivational beliefs, values, and goals clarifies that aspiration depends on both perceived value and perceived feasibility (Eccles & Wigfield, 2002).

These perspectives suggest that doctoral interest is neither a pure psychological preference nor a direct indicator of future behavior. It is an attitudinal signal shaped by professional capital, institutional encouragement, geographic access, and perceived feasibility. This matters analytically because low expressed interest may reflect aspiration suppression under structural constraint, not lack of ability or long-term potential.

This study therefore adopts a structural opportunity theory of doctoral interest visibility. The phrase means that expressed doctoral interest becomes visible when librarians have sufficient exposure to doctoral pathways, institutional encouragement, research capital, geographic or digital access, and perceived feasibility to articulate doctoral study as a plausible future. The framework does not claim that these conditions cause aspiration in a counterfactual sense. It claims that these conditions shape whether aspiration is likely to be observable in cross-sectional workforce data.

2.3 Interpretable Educational Analytics

Predictive analytics has become a prominent approach in higher education because institutions increasingly need to transform complex administrative, learning, and career evidence into decisions about student support, program design, retention, and resource allocation. Systematic reviews of predictive learning analytics show that machine-learning models have been used across higher education to anticipate academic outcomes, dropout, satisfaction, and enrollment-related trajectories (Sghir et al., 2022). Educational data mining provides the broader methodological foundation for this work by applying statistical and computational methods to educational evidence in order to detect patterns that are difficult to identify manually (Hernandez-Blanco et al., 2019).

The present article adapts this tradition from enrolled-student analytics to expressed-interest analytics among professionals who are not yet applicants. This is a strategic shift. Conventional predictive learning analytics often centers on students already inside an institution. Doctoral pipeline planning begins earlier, with the professional workforce from which future students may emerge. In this setting, the value of prediction lies less in operational triage and more in institutional foresight: identifying where expressed interest is concentrated and where support pathways may be needed.

Comparable logic appears in other professional education domains. In medical workforce planning, for instance, researchers have proposed using machine learning to connect career intentions with individual and contextual evidence so that educational institutions and national stakeholders can better anticipate future workforce distributions (Abbiati et al., 2024). The analogy is not that LIS doctoral interest is identical to medical career choice, but that both problems require planning under uncertainty. In both cases, educational institutions must interpret stated aspirations before they become confirmed enrollments or completed career pathways.

2.4 Segmentation and Persona Research

Prediction alone is rarely sufficient for program strategy. A model may estimate likelihood, rank predictors, or distinguish broad patterns, but institutional leaders still need a language for action. Segmentation provides that language by grouping members of a target population into strategically meaningful profiles. In enrollment planning, segmentation can help differentiate immediate high-readiness audiences from developmental audiences who may require bridge pathways, advising, financial support, or flexible delivery models.

This distinction is important for doctoral LIS education because the relevant audience is heterogeneous. Some professionals may already possess forms of educational and professional capital that make doctoral study appear feasible, while others may express interest but face stronger access barriers. A segmentation framework allows a proposed program to avoid treating the profession as a single recruitment market. Instead, it can design differentiated outreach and support strategies around empirically derived audience profiles.

In this sense, the purpose of machine learning is not simply classification. It is strategic compression: reducing a complex professional landscape into interpretable patterns that are sufficiently robust to guide planning, but sufficiently cautious to avoid overclaiming. This approach is especially appropriate for exploratory interest research, where the available evidence can identify plausible patterns of aspiration but cannot by itself prove eventual application, admission, or enrollment behavior.

Personas offer a bridge between quantitative segmentation and human-centered program design. In design and education research, personas are used to represent groups of users or learners in a form that helps decision-makers reason about needs, motivations, constraints, and interventions. Recent work on quantitative or data-driven personas argues that personas can be constructed from empirical evidence when the evidence is selected for a clear design purpose and translated into interpretable profiles (Almahri et al., 2019; Park & Kang, 2022).

For doctoral program planning, personas are useful because they connect statistical patterns to institutional decisions. A persona can suggest what kind of advising message might resonate, what barriers a prospective enrollee might face, what delivery arrangements may matter, and what kind of scholarly identity the program should cultivate. Personas therefore help move from “who is more likely” to “what should the program do differently for this audience?”

At the same time, data-driven personas carry risks. Salminen, Jung, and Jansen warn that personas can become harmful when they are based on superficial data, weak communication about methods, or overconfident interpretations of messy evidence (Salminen et al., 2021). For this article, that critique is not a reason to avoid personas; it is a reason to build them transparently. Persona labels must be treated as analytic summaries, not identities. They should remain traceable to the modeling workflow, open to revision, and bounded by the limitations of the evidence from which they are derived.

Latent segmentation literature also matters because personas are only as credible as the segments from which they are translated. Latent class analysis treats unobserved subgroups probabilistically and provides tools for evaluating model fit, class separation, and classification uncertainty (L. M. Collins & Lanza, 2009). This logic informs the present study’s caution: segment labels are heuristic summaries of statistical patterns, not stable sociological types.

2.5 Fairness and Ethics in Predictive Educational Systems

Predictive educational systems can easily shift from support to governance if their outputs are treated as objective judgments. Slade and Prinsloo argue that learning analytics raises ethical issues around responsibility, agency, and institutional power (Slade & Prinsloo, 2013). Rudin’s argument for interpretable models in high-stakes settings further supports the use of transparent and inspectable analytics rather than opaque decision systems (Rudin, 2019). Fairness research also cautions that group-level performance, error disparities, and calibration differences matter when predictive systems are used in socially consequential domains (Barocas et al., 2023).

For this study, ethical analytics means that prediction must remain non-exclusionary. Model outputs may guide where institutions invest additional advising, mentoring, and bridge support, but they must not be used to rank individuals, deny opportunity, or define admissions worthiness. Critical educational analytics research further cautions that data systems can naturalize categories, obscure institutional power, and reproduce inequality when outputs are treated as neutral descriptions of learners or professionals (Knox, 2017; Macgilchrist, 2021).

2.6 Conceptual Framework

The study is anchored in an educational and professional mobility framework. Bourdieu’s account of capital is useful because doctoral aspiration is shaped not only by individual preference, but also by accumulated educational, social, professional, and symbolic resources (Bourdieu, 1986). Perna’s college access and choice model similarly emphasizes that educational choice is embedded in layered social, institutional, and policy contexts rather than reducible to individual motivation (Perna, 2006). Abbott’s sociology of professions further positions advanced expertise and jurisdictional claims as central to professional development, making doctoral LIS education relevant to the future organization of information work (Abbott, 1988). Finally, expectancy-value theory clarifies why expressed interest must be distinguished from actual enrollment: individuals may value doctoral study while doubting feasibility, cost, access, or institutional fit (Eccles & Wigfield, 2002).

The conceptual logic is as follows. Structural conditions, such as geographic location, institutional setting, employment sector, and access to graduate education, shape opportunities for professional capital accumulation. Professional capital includes research exposure, continuing education, leadership experience, academic engagement, and professional networks. These resources influence expressed doctoral study interest, which becomes observable in survey data. Predictive modeling then classifies patterns associated with that expressed interest. Segmentation groups statistically similar profiles, and persona translation converts those profiles into bounded planning abstractions.

Figure 1 provides the visual framework guiding the study. The figure was developed for AI/ML-generated student personas and perceived usefulness in LIS education; here it is used as a conceptual scaffold for the analytic sequence from multi-source evidence, to persona generation, to planning usefulness, outcomes, boundary conditions, assumptions, and feedback loops.

Figure 1: Interpretable and ethical analytics framework for AI/ML-generated personas in LIS education.

As shown in Figure 1, persona generation is not the endpoint of the analysis. The framework requires explicit attention to mediating mechanisms, boundary conditions, assumptions, and feedback. For the present study, those boundary conditions include the cross-sectional nature of the data, the distinction between expressed interest and enrollment, the non-causal design, and the prohibition against using persona outputs for individual ranking or admissions decisions.

2.7 Gap and Positioning of the Present Study

The literature establishes five foundations for the present study. First, LIS education is a professional and workforce institution, not only an academic credentialing pathway (Birdi, 2022; Katuli-Munyoro & Mutula, 2017). Second, educational aspiration is structured by capital, access, professional identity, and perceived feasibility rather than by interest alone (Bourdieu, 1986; Eccles & Wigfield, 2002; Perna, 2006). Third, predictive analytics can support educational planning when used to identify patterns rather than to make deterministic claims about individuals (Hernandez-Blanco et al., 2019; Sghir et al., 2022). Fourth, workforce-oriented prediction is especially valuable when institutions must plan for future choices before those choices become observable outcomes (Abbiati et al., 2024). Fifth, data-driven personas can translate quantitative evidence into strategic design language, provided they are constructed with transparency and caution (Almahri et al., 2019; Park & Kang, 2022; Salminen et al., 2021).

What remains underdeveloped is a reproducible, LIS-specific framework that connects national workforce evidence, machine-learning classification, professional segmentation, and ethically bounded persona translation. Existing work provides the conceptual pieces, but fewer studies integrate them into a single article-length analytic workflow for PhD program planning in LIS. This study addresses that gap by treating machine learning as a strategic instrument for classifying expressed doctoral study interest while preserving the distinction between aspiration, recruitment potential, and realized enrollment.

3 Methodology

3.1 Sequential Predictive Segmentation Design

This study uses a sequential explanatory analytics design: a cross-sectional interpretable predictive modeling stage followed by post-hoc segmentation and bounded persona translation. The analysis is exploratory and strategic rather than causal or confirmatory. Its purpose is to classify patterns of expressed doctoral study interest, identify interpretable professional segments, and translate empirically stable segment patterns into recruitment personas. Consistent with transparent prediction-model reporting principles, the section documents data provenance, outcome construction, eligibility rules, preprocessing, model development, validation, and interpretive use of model outputs (G. S. Collins et al., 2015).

The study follows a prediction-and-interpretation workflow. The major stages are summarized in Table 1, which functions as the reproducibility map for the rest of the article.

Define the analytic outcome as expressed doctoral study interest.
Construct an eligible analytic sample from the census.
Prepare predictors representing demographic, educational, employment, geographic, professional-development, and professional-affiliation domains.
Train interpretable benchmark and non-linear machine-learning models.
Evaluate discrimination, calibration-relevant summaries, and classification performance.
Use model scores and segmentation to construct bounded strategic personas.
Interpret the personas as planning profiles rather than deterministic labels.

Table 1: Overview of the reproducible methodology.

Stage	Purpose
Data source	Use the openly available Philippine Librarians Census as national workforce evidence.
Outcome construction	Identify respondents reporting doctoral study as a future educational plan, interpreted as expressed interest rather than enrollment behavior.
Analytic sample	Restrict analysis to respondents with observable outcome status and without current PhD attainment.
Preprocessing	Clean missing values, recode outcome, retain substantively relevant planning predictors, and avoid leakage.
Prediction	Classify expressed doctoral study interest using interpretable and non-linear classifiers.
Segmentation	Group respondents into strategically meaningful workforce profiles.
Persona synthesis	Translate statistical patterns into recruitment and program-design personas with explicit anti-stereotyping safeguards.

As Table 1 shows, the workflow is organized from provenance to interpretation. This order is deliberate: the personas are presented only after the outcome, sample, predictors, preprocessing, prediction, and segmentation logic have been made explicit.

3.2 Data Source and Census Structure

The empirical source is the Philippine Librarians Census (Obille & Dorado, 2022), a national survey conducted by the University of the Philippines School of Library and Information Studies from November 2018 to October 2019 and released openly through Zenodo. The census was designed to establish baseline evidence for LIS workforce planning, professional development, education, and policy. It remains especially valuable for this study because it links professional circumstances with educational aspirations in a single national workforce instrument.

The analysis uses the local RDS file included with this article’s reproducible materials. The file is read directly in R to avoid transcription or spreadsheet-conversion errors. Table 2 records the analytic file path and confirms the dimensions of the imported object.

Table 2: Imported source data audit.

Object	Source_File	Rows	Columns
raw_census	LibrarianCensus.rds	684	107

The audit in Table 2 establishes that the analysis begins from the expected source object before any exclusions or transformations are applied. This table is included to make the starting point of the workflow inspectable.

3.3 Reproducibility Repository and Public Rendering

The article is accompanied by a public reproducibility repository at https://github.com/dddorado/lis-phd-workforce-analytics. The repository contains the Quarto manuscript, local RDS data source, bibliography, citation style, generated tables, generated figures, model diagnostic output, conceptual framework image, and rendered HTML and PDF outputs. It is intended to make the full evidentiary chain inspectable: readers can trace the analysis from the imported census object, through the executable R workflow, to the tables and figures discussed in the results.

A public rendered instance of the article is available on RPubs at https://rpubs.com/danddorado/1434018. The RPubs version serves as a stable reading copy, while the GitHub repository functions as the computational archive. The two outputs should be read together: RPubs supports access and review, and GitHub supports rerendering, file inspection, version control, and verification of the analysis artifacts. This reproducibility arrangement does not change the inferential limits of the study; it makes those limits more transparent by exposing the source file, operationalization decisions, generated outputs, and reporting workflow used to produce the manuscript.

3.4 Outcome Operationalization and Construct Validity

The dependent variable is expressed doctoral study interest, operationalized from the survey’s forward-looking education-plan item. Respondents who selected PhD are coded as the positive class, while respondents who selected another observed plan are coded as the comparison class. Respondents without an observed future-education plan are excluded from supervised modeling because their outcome status cannot be determined. Respondents already reporting PhD attainment are also excluded so that the model classifies possible future-oriented interest rather than describing current PhD holders.

This outcome is interpreted as expressed interest, not as application, admission, enrollment, persistence, capability, suitability, or demand for a specific program. The distinction is central to construct validity. A stated plan can inform recruitment strategy, but it cannot by itself establish realized demand. The binary coding is therefore a pragmatic classification device rather than a claim that doctoral interest is naturally dichotomous. Table 3 documents the resulting analytic sample and the observed positive-class rate after applying these rules.

Table 3: Analytic outcome and sample audit.

Measure	Value
Original census records	684
Records with observed modeled outcome and no current PhD	420
Positive outcome records	135
Observed positive outcome rate	32.1%

As shown in Table 3, the supervised analysis is based on the subset of records for which expressed doctoral study interest can be observed. The positive-class rate provides the baseline against which later model performance and segmentation results should be read.

3.5 Predictor Domains

RQ1 organizes predictors into four conceptual domains: demographic location in the workforce, educational capital, professional capital, and institutional access conditions. Demographic variables describe social and geographic position. Educational variables represent accumulated formal preparation. Professional variables represent career stage, role, sector, and workplace experience. Institutional access variables represent conditions that may make doctoral study more or less feasible. This domain structure is used to prevent a purely data-driven predictor list and to keep interpretation tied to the theoretical framework.

3.6 Leakage Control

Predictors are selected to represent workforce characteristics that could plausibly inform doctoral recruitment strategy before a PhD program has direct applicant data. The candidate predictor domains include demographic background, professional experience, educational preparation, current study status, employment setting, institutional context, compensation, region, mobility, continuing professional development, collection or service indicators, and professional affiliation.

Variables that directly encode the outcome, duplicate the outcome, or would be unavailable in a real recruitment-planning scenario are excluded from model training. This leakage-control step is important because the goal is not merely to maximize predictive performance. The goal is to estimate a credible planning model whose predictors could be interpreted ethically and operationally. Table 4 summarizes how many protocol-defined predictors are available in the imported source.

Table 4: Predictor availability audit.

Measure	Value
Candidate predictors named in protocol	46
Available predictors	46
Unavailable predictors	0

The availability audit in Table 4 is a reproducibility checkpoint. It separates the conceptual modeling protocol from the fields actually present in the local source file, making later changes in source structure easier to diagnose.

3.7 Missing Data and Preprocessing

The preprocessing protocol is designed to preserve substantive meaning while making the data suitable for supervised learning and segmentation. Text fields are trimmed, common non-informative responses are converted to missing values, categorical predictors are treated as nominal unless an ordered interpretation is explicit, and numeric fields are retained as numeric when their meaning is stable. Apparent extreme values in age and years of service are treated conservatively as missing for modeling because they are more likely to reflect data-entry noise than plausible professional histories.

Missingness is handled inside the modeling workflow rather than by deleting all incomplete records. The working assumption is that missingness is unlikely to be completely random because survey nonresponse may reflect professional context, question relevance, or respondent burden. Numeric predictors are imputed using training-set medians. Categorical predictors are imputed using an explicit missing category. Low-frequency categorical levels may be pooled during modeling to reduce instability. The exact imputation and encoding operations are to be estimated within resampling folds to avoid information leakage from assessment data into training data. Table 5 identifies the candidate predictors with the highest missingness after sample construction, which is useful for interpreting model stability and the limits of persona detail.

Table 5: Highest-missingness candidate predictors after analytic-sample construction.

	Field	Missing_N	Missing_Percent
completing	completing	199	47.4
net_salary_group	net_salary_group	126	30.0
gross_salary_group	gross_salary_group	85	20.2
type	type	77	18.3
net_salary	net_salary	54	12.9
gross_salary	gross_salary	50	11.9
location	location	22	5.2
lvm	lvm	22	5.2
worktravel	worktravel	20	4.8
position	position	6	1.4
tenure	tenure	4	1.0
cpdsatisfaction	cpdsatisfaction	4	1.0
industry	industry	3	0.7
age_group	age_group	2	0.5
benefits6	benefits6	2	0.5

The missingness pattern in Table 5 is not treated as a reason to discard the analytic sample. Instead, it informs cautious interpretation: highly incomplete fields may still contribute planning signal, but they should not carry excessive weight in persona narratives.

3.8 Modeling and Validation

Two complementary supervised-learning approaches are specified. First, a regularized logistic regression model serves as an interpretable benchmark because it estimates a linear decision surface and can identify stable directional associations after encoding. Second, a random forest model serves as a flexible non-linear classifier because it can capture interactions and threshold effects common in workforce data. Where software availability permits, explainable boosting or SHAP-supported gradient boosting may be added as a sensitivity model, but black-box deep learning is not appropriate for the size, structure, or institutional stakes of this dataset. The benchmark model emphasizes transparency; the non-linear model emphasizes predictive flexibility.

Model development is organized around a stratified 75/25 train-test split, with repeated five-fold cross-validation within the training data. Hyperparameters are selected within the resampling procedure using ROC AUC as the primary tuning criterion and average precision as a secondary check under class imbalance. The holdout test set is reserved for final performance estimation. The primary discrimination metric is ROC AUC. Secondary metrics include average precision, balanced accuracy, sensitivity, specificity, precision, recall, F1, Brier score, and calibration diagnostics. Because this is a planning model, no single threshold is treated as universally correct. Threshold-dependent metrics are reported at a transparent reference threshold and may be varied in sensitivity analysis rather than interpreted as final decision rules.

The preferred validation architecture includes calibration assessment and subgroup performance checks. Calibration matters because institutional planning depends on whether estimated probabilities are numerically meaningful, not only whether cases are ranked correctly. Subgroup performance checks are required because educational access and professional opportunity can be stratified by gender, region, sector, and institutional context. These checks are interpreted as bias diagnostics rather than as claims of model fairness.

3.9 Calibration and Fairness Audit

Because doctoral access is socially and institutionally stratified, predictive segmentation can reproduce existing inequalities if used carelessly. The study therefore treats bias assessment as part of interpretation rather than as an optional technical appendix. At minimum, subgroup diagnostics should examine model performance across gender, region, institutional sector, and employment setting. Relevant summaries include subgroup recall, false-positive rate, false-negative rate, subgroup calibration, demographic parity of high-score classification, and the distribution of predicted probabilities across groups.

These diagnostics do not make the model fair by themselves. They identify where model outputs may be less reliable, where outreach could become exclusionary, and where program planners should avoid overinterpreting modeled likelihood. The ethical standard is non-exclusionary deployment: model outputs may inform where additional advising, bridge support, or recruitment attention is needed, but they must not be used to restrict opportunity.

3.10 Segmentation Stability Assessment

The persona component requires segmentation in addition to prediction. Segmentation is conducted on planning-relevant workforce features rather than on the outcome alone. This ensures that personas represent recognizable professional profiles, not merely high- and low-probability bins. The segmentation procedure standardizes numeric features, encodes categorical features, and compares candidate solutions for interpretability, separation, and strategic usefulness.

The preferred solution is not selected mechanically by a single index. It must satisfy four criteria: statistical adequacy, stability, interpretability to LIS decision-makers, and usefulness for differentiated recruitment or program design. Latent class analysis or Gaussian mixture modeling would be preferred when assumptions and software support permit probabilistic class membership. K-means can be used only as a pragmatic exploratory method when variables are carefully encoded and standardized, and when the solution is interpreted cautiously.

Cluster validation should include silhouette or comparable separation diagnostics, sensitivity to the number of segments, sensitivity to initialization, entropy or posterior classification uncertainty when probabilistic methods are used, and bootstrap or resampling-based stability checks. Each resulting segment is profiled by its size, observed expressed-interest rate, central tendencies, dominant professional characteristics, uncertainty indicators, and modeled likelihood distribution. Segments are retained only if they are reproducible enough to support interpretation and distinct enough to justify differentiated planning.

3.11 Persona Translation Protocol

Personas are constructed after prediction and segmentation, using four evidence streams: observed expressed interest by segment, modeled likelihood by segment, dominant professional characteristics, and strategic recruitment implications. Each persona includes a descriptive label, profile markers, likely program-design needs, recruitment implications, uncertainty qualifiers, and cautions. Labels are intentionally interpretive and should be read as planning shorthand rather than as essential characteristics of individuals.

Persona construction follows explicit translation rules. First, every persona must be traceable to a segment or empirically observed profile. Second, persona narratives must distinguish expressed interest from actual enrollment behavior. Third, no persona should be used to exclude individuals from outreach, advising, or opportunity. Fourth, every persona must include the statement: “This profile represents a statistical tendency rather than an individual prediction.” Fifth, persona labels must avoid claims about ability, merit, motivation, or psychological type unless directly supported by evidence. Sixth, personas should be reviewed by LIS educators or stakeholders before operational use to assess resonance, stereotype risk, and practical relevance. The proper use of personas is to broaden and sharpen recruitment design, not to narrow access. Table 6 specifies the required components of each persona so that the narrative profiles remain auditable.

Table 6: Persona construction template.

Element	Description
Persona label	Short strategic name for the segment.
Empirical basis	Segment membership, observed outcome rate, and modeled likelihood distribution.
Profile markers	Dominant professional and educational characteristics.
Recruitment implication	Message, channel, or advising strategy suggested by the profile.
Program-design implication	Scheduling, bridge, supervision, funding, or curriculum implication.
Uncertainty qualifier	Explicit reminder that the persona is a statistical tendency rather than an individual prediction.
Caution	Boundary condition preventing overinterpretation or exclusionary use.

As Table 6 indicates, each persona must include both strategic implications and a caution statement. This keeps the personas useful for recruitment design while reducing the risk that they will be mistaken for fixed identities or eligibility categories.

3.12 Ethical Governance

The analysis uses openly available secondary data and reports only aggregate patterns. Its purpose is strategic program planning, not individual-level selection, eligibility screening, or automated decision-making. Even when a model assigns high or low likelihood to a profile, that estimate should not be interpreted as an individual’s capacity for doctoral study. Doctoral participation is shaped by institutional support, financing, mentoring, scheduling, research opportunity, and personal circumstances that may not be fully captured in workforce data.

The ethical stance of the study is therefore opportunity-oriented. Machine learning is used to identify where program outreach, bridge advising, and support structures may be most needed. Personas are treated as tools for institutional empathy and planning discipline, not as fixed categories of professional worth.

4 Results

4.1 Analytic Sample and Outcome Distribution

The first result is the construction of the analytic sample used for classification and persona development. Table 7 summarizes the source file, the retained modeling sample, and the observed expressed-interest rate. The table establishes the empirical baseline for the remaining results: the analysis is not estimating universal demand, but modeling the subset of respondents for whom future educational intention can be observed and interpreted.

Table 7: Analytic sample and outcome summary.

Measure	Value
Rows in original RDS	684
Variables in original RDS	107
Modeling sample after target observed and current PhD excluded	420
PhD-intending respondents in modeling sample	135
Observed PhD-interest rate	32.1%
Outcome definition	after5years == “Ph.D.”

As shown in Table 7, roughly one-third of the analytic sample indicated expressed doctoral study interest. This rate is large enough to justify classification as an exploratory planning exercise, but it also confirms that doctoral interest is not evenly distributed across the whole professional population. Recruitment strategy should therefore be segmented rather than generic.

Figure 2 visualizes the same outcome distribution. The figure is useful because it makes the class balance visible before model performance is interpreted. A moderately imbalanced outcome requires attention to metrics beyond accuracy, especially recall, precision, balanced accuracy, calibration, and area-under-curve measures.

Figure 2: Distribution of the modeled doctoral-aspiration outcome.

The distribution in Figure 2 supports the use of both threshold-independent and threshold-dependent metrics. Because the positive class represents a strategic recruitment audience rather than a rare adverse event, the analysis prioritizes ranking, segmentation, and interpretation over a single fixed classification threshold.

4.2 Expressed Doctoral Study Interest Across Professional Groups

Before interpreting model output, it is important to inspect observed expressed-interest rates across major professional groups. Table 8 reports the highest observed group rates among categories with sufficient representation. These descriptive patterns are not causal estimates, but they identify where expressed doctoral study interest appears most concentrated.

Table 8: Highest observed doctoral-aspiration rates by professional group.

	Variable	Category	N	PhD-intending N	PhD-interest rate
25	age_group	51-60	14	9	0.643
20	Position level	Management	11	7	0.636
3	Highest/current education	MLIS	140	85	0.607
4	Highest/current education	Master degree	23	12	0.522
1	Currently enrolled	Yes	166	77	0.464
10	Island group	Mindanao	54	23	0.426
26	age_group	41-50	70	29	0.414
15	Library type	Academic	197	80	0.406
21	Position level	Supervisory	127	50	0.394
7	Institution sector	Government	196	76	0.388
11	Island group	Luzon	138	51	0.370
27	age_group	31-40	126	43	0.341

The pattern in Table 8 suggests that expressed interest is associated with professional capital and institutional location rather than simple headcount alone. The table also shows why persona construction is preferable to broad recruitment messaging: high-interest groups can be small, while larger groups may require different forms of preparation and support.

The group-level plots in Figure 3, Figure 4, Figure 5, and Figure 6 provide complementary visual checks. Together, they show that expressed interest varies across educational preparation, institutional sector, library type, and position level, reinforcing the need for differentiated recruitment and advising.

Figure 3: Observed doctoral-aspiration rate by educational background.

Figure 3 indicates that educational background is one of the clearest descriptive separators of expressed doctoral study interest. This does not imply that less advanced groups should be excluded from outreach; rather, it suggests that bridge pathways and advising may be especially important for audiences earlier in the graduate-study pipeline.

Figure 4: Observed doctoral-aspiration rate by institutional sector.

Figure 4 shows that institutional context matters for interpreting demand. Sectoral differences may reflect career incentives, promotion structures, research expectations, and access to graduate-study support.

Figure 5: Observed doctoral-aspiration rate by library type.

Figure 5 adds a service-context dimension to the results. Library type helps distinguish audiences whose doctoral motivations may differ, such as research leadership, academic service development, professional advancement, or specialized institutional capacity-building.

Figure 6: Observed doctoral-aspiration rate by position level.

Figure 6 suggests that career stage is also relevant. Doctoral recruitment cannot be reduced to early-career interest alone; it must also account for mid-career and senior professionals who may view doctoral study as a route to leadership, research authority, or academic mobility.

4.3 Predictive Model Performance

The supervised models were evaluated through repeated cross-validation, with regularized logistic regression serving as an interpretable benchmark and random forest serving as a flexible non-linear classifier. Table 9 reports the cross-validated performance metrics for both models.

Table 9: Cross-validated predictive performance by model.

Model	Roc Auc	Average Precision	Balanced Accuracy	F1	Recall	Precision
Regularized logistic regression	0.793 +/- 0.026	0.601 +/- 0.029	0.711 +/- 0.039	0.608 +/- 0.051	0.674 +/- 0.100	0.558 +/- 0.029
Random forest	0.816 +/- 0.016	0.622 +/- 0.066	0.700 +/- 0.028	0.594 +/- 0.040	0.659 +/- 0.103	0.548 +/- 0.022

As Table 9 shows, both models perform meaningfully above chance, indicating that expressed doctoral study interest is patterned rather than random with respect to the available workforce evidence. The random forest has the stronger ROC AUC and average precision, while the logistic model remains useful as a transparent benchmark. The appropriate interpretation is therefore not that the model can determine who will enroll, but that the data contain enough signal to support strategic segmentation.

4.4 Calibration and Bias Diagnostics

Discrimination metrics show whether the model ranks cases effectively, but planning also requires attention to calibration and subgroup reliability. Table 10 reports decile-level calibration of the modeled interest scores. The table compares the mean predicted score with the observed expressed-interest rate within score bands.

Table 10: Calibration of modeled doctoral-interest scores by decile.

Score_Band	N	Mean_Predicted	Observed_Rate	Calibration_Gap
[0.078,0.152]	42	0.125	0.000	0.125
(0.152,0.2]	42	0.174	0.024	0.150
(0.2,0.251]	42	0.220	0.000	0.220
(0.251,0.375]	42	0.320	0.000	0.320
(0.375,0.452]	42	0.408	0.048	0.360
(0.452,0.524]	42	0.485	0.333	0.152
(0.524,0.584]	42	0.554	0.548	0.007
(0.584,0.642]	42	0.613	0.452	0.161
(0.642,0.705]	42	0.669	0.810	-0.140
(0.705,0.785]	42	0.737	1.000	-0.263

Table 10 should be interpreted as a planning diagnostic rather than as proof of deployable probability accuracy. Large gaps between predicted and observed rates would indicate that modeled probabilities should be used mainly for ranking and segmentation, not for precise enrollment forecasting.

Bias diagnostics examine whether model performance varies across major subgroups. Table 11 summarizes subgroup sample size, observed expressed-interest rate, mean modeled score, recall, and false-positive rate for selected grouping variables. These metrics are not a complete fairness audit, but they identify where model interpretation may require additional caution.

Table 11: Subgroup performance diagnostics for modeled doctoral-interest scores.

	Variable	Group	N	Observed_Rate	Mean_Score	Calibration_Gap	High_Score_Rate	Precision	Recall	False_Positive_Rate	False_Negative_Rate
Female	gender	Female	317	0.312	0.428	0.115	0.432	0.672	0.929	0.206	0.071
Male	gender	Male	78	0.385	0.453	0.068	0.462	0.722	0.867	0.208	0.133
With diverse SOGIE	gender	With diverse SOGIE	18	0.278	0.411	0.134	0.389	0.714	1.000	0.154	0.000
Government	institution	Government	196	0.388	0.484	0.096	0.500	0.745	0.961	0.208	0.039
Private	institution	Private	192	0.266	0.384	0.118	0.370	0.620	0.863	0.191	0.137
NGO	institution	NGO	31	0.258	0.394	0.136	0.419	0.538	0.875	0.261	0.125
NCR	lvm	NCR	165	0.279	0.398	0.120	0.370	0.672	0.891	0.168	0.109
Luzon	lvm	Luzon	138	0.370	0.455	0.086	0.493	0.706	0.941	0.230	0.059
Mindanao	lvm	Mindanao	54	0.426	0.498	0.072	0.574	0.677	0.913	0.323	0.087
Visayas	lvm	Visayas	41	0.293	0.429	0.136	0.366	0.733	0.917	0.138	0.083
Missing1	lvm	Missing	22	0.136	0.355	0.219	0.318	0.429	1.000	0.211	0.000
Academic	type	Academic	197	0.406	0.487	0.081	0.563	0.676	0.938	0.308	0.062
School	type	School	89	0.213	0.339	0.125	0.258	0.696	0.842	0.100	0.158
Missing	type	Missing	77	0.325	0.430	0.106	0.416	0.719	0.920	0.173	0.080
Special	type	Special	41	0.171	0.345	0.174	0.195	0.750	0.857	0.059	0.143
Public	type	Public	16	0.250	0.473	0.223	0.500	0.500	1.000	0.333	0.000

Table 11 is included to prevent a common overclaim: a model with acceptable overall performance may still be less reliable for specific groups. The table reports subgroup precision, recall, false-positive rate, false-negative rate, high-score classification rate, and calibration gap as descriptive fairness diagnostics. It therefore supports the article’s non-deployment boundary. The model can inform institutional planning, but any future operational use would require more formal fairness evaluation, stakeholder review, and external validation.

4.5 Model Interpretation and Feature Importance

Model interpretation focuses on global feature importance rather than individual-level prediction. Table 12 lists the top-ranked predictors from the random forest model, while Figure 7 visualizes the same ranking. These outputs are used to identify the broad dimensions most relevant to planning, not to assign deterministic importance to any one person’s profile.

Table 12: Top model feature-importance rankings.

Variable	Importance	Relative importance
Highest/current education	0.274	0.274
Currently enrolled	0.119	0.119
Age	0.066	0.066
Years in service	0.056	0.056
Gross salary bracket/value	0.051	0.051
Library type	0.041	0.041
Net salary bracket/value	0.033	0.033
theses	0.030	0.030
Island group	0.027	0.027
CPD sufficiency	0.024	0.024
Willing to travel for work/study	0.024	0.024
Position level	0.023	0.023

Table 12 shows that the strongest signals are not isolated demographic traits but a cluster of educational, career-stage, institutional, and professional-development indicators. This pattern supports the article’s persona strategy: expressed-interest profiles are best understood as configurations of professional capital and context rather than as single-variable categories.

Figure 7: Random forest feature-importance ranking.

The visual ranking in Figure 7 confirms the same interpretation in a more scannable form. The steepness of the ranking also suggests that a small number of planning dimensions carry much of the model’s global signal, while lower-ranked features should be interpreted more cautiously.

4.6 Market Segments

Segmentation translates model-relevant patterns into strategic audience groups. Table 13 reports the four market segments, their sizes, observed expressed-interest counts, rates, and dominant profile markers. Figure 8 provides a visual summary of these segment differences.

Table 13: Market segments for PhD in LIS recruitment planning.

Segment	N	PhD-intending N	PhD-interest rate	Median age	Median years service	Top education	Top sector	Top library type	Top position	Top island group	Currently enrolled share	Travel willing share
Advanced-degree growth seekers	19	10	0.526	37	14.0	MLIS	Private	Academic	Supervisory	NCR	0.474	0.526
Mid-career government career consolidators	103	46	0.447	43	17.0	MLIS	Government	Academic	Supervisory	Luzon	0.214	0.323
Younger government academic aspirants	140	59	0.421	30	6.5	BLIS	Government	Academic	Senior level	NCR	0.679	0.415
Early-career credential builders	158	20	0.127	24	2.0	BLIS	Private	Academic	Junior level	NCR	0.255	0.327

The segments in Table 13 show a strategic tension common in doctoral pipeline planning: the smallest segment has the highest observed expressed-interest rate, while larger segments may represent broader but more developmentally varied recruitment audiences. This finding argues for a portfolio strategy rather than a single recruitment campaign.

Figure 8: Observed doctoral aspiration and profile differences by market segment.

Figure 8 reinforces the segment-level interpretation. The visual separation among segments suggests that recruitment audiences differ not only in estimated interest but also in professional profile, implying different messages, supports, and program pathways.

Because segment interpretation is vulnerable to overstatement, Table 14 reports a minimal stability diagnostic using the scored sample: segment size, share of the analytic sample, mean modeled score, and within-segment score dispersion. These values do not replace formal bootstrap stability, entropy, or latent-class diagnostics, but they make visible whether segments are sharply or weakly separated in modeled-score space.

Table 14: Descriptive segment separation and uncertainty diagnostics.

	Segment	N	Share	Mean_Score	Score_SD	Mean_Silhouette	Bootstrap_Agreement
3	Mid-career government career consolidators	103	0.245	0.568	0.164	0.032	0.859
1	Advanced-degree growth seekers	19	0.045	0.561	0.131	0.064	0.923
4	Younger government academic aspirants	140	0.333	0.506	0.162	0.031	0.788
2	Early-career credential builders	158	0.376	0.259	0.146	0.083	0.885

Table 14 reinforces the need for cautious persona translation. The mean silhouette and bootstrap agreement columns are descriptive stability indicators, not definitive validation. Segments with low separation, weaker bootstrap agreement, or overlapping score distributions should be treated as planning strata rather than as sharply bounded populations. Formal latent profile or latent class validation remains a priority for future research.

4.7 Recruitment Personas

The final result is the translation of segments into personas. Table 15 presents the persona labels, associated segments, approximate sizes, observed expressed-interest rates, profile markers, and recruitment implications. These personas are communication abstractions for institutional planning, not psychological profiles or stable professional identities.

Table 15: Recruitment personas derived from prediction and segmentation results.

Persona	Associated segment	Approximate size	Observed PhD-interest rate	Profile markers	Recruitment implication
The advanced academic professional	Advanced-degree growth seekers	19	0.526	MLIS; Private; Academic; Supervisory; median age 37	Emphasize research mentoring, publication pathways, flexible dissertation supervision, and recognition of prior graduate work.
The public-sector advancement candidate	Mid-career government career consolidators	103	0.447	MLIS; Government; Academic; Supervisory; median age 43	Frame the PhD around leadership, policy, evidence-based service improvement, and government-compatible scheduling.
The emerging academic-service leader	Younger government academic aspirants	140	0.421	BLIS; Government; Academic; Senior level; median age 30	Position the PhD as a pathway from professional service to research leadership, with clear MLIS-to-PhD advising and scholarship guidance.
The early-career credential builder	Early-career credential builders	158	0.127	BLIS; Private; Academic; Junior level; median age 24	Offer bridge advising from BLIS/MLIS, staged milestones, peer cohorts, and funding information that clarifies the path to doctoral readiness.

Table 15 identifies four bounded recruitment abstractions. The advanced academic professional profile appears closest to immediate doctoral recruitment because of advanced preparation and high observed interest. The public-sector advancement candidate and emerging academic-service leader profiles represent larger strategic audiences whose expressed interest may be tied to leadership, institutional contribution, and career mobility. The early-career credential builder profile has lower immediate expressed interest but remains important for pipeline development, bridge advising, and long-term program sustainability. Each persona represents a statistical tendency rather than an individual prediction.

Personas developed in this study are probabilistic communication abstractions derived from aggregated workforce patterns and should not be interpreted as deterministic representations of individual librarians. Because segment boundaries overlap and the source evidence is cross-sectional, persona labels should remain provisional, revisable, and subject to stakeholder validation before any operational use.

Taken together, the results support the use of machine learning as a planning instrument for PhD in LIS development. The models identify meaningful signal, the feature rankings clarify the broad dimensions of that signal, the segments organize the professional landscape, and the personas translate the analysis into differentiated recruitment and program-design strategies without claiming to predict actual enrollment.

5 Discussion

5.1 Principal Findings

The findings show that expressed doctoral study interest among Philippine librarians is patterned enough to support strategic classification, segmentation, and persona development. The models do not merely reproduce a random distribution of interest. They identify a coherent structure in which educational preparation, career stage, institutional context, professional development, and mobility-related conditions combine to distinguish different professional profiles. This is the central methodological contribution of the study: machine learning can help convert national workforce evidence into a practical planning language for doctoral LIS education without claiming to forecast actual enrollment.

The results also show that the most strategically important audiences are not identical in size, expressed-interest rate, or recruitment logic. The smallest segment has the highest observed expressed-interest rate, while larger segments contain more varied profiles. This distinction matters for program planning. A proposed PhD in LIS should not rely only on the most immediately reachable candidates, nor should it assume that all interested professionals need the same recruitment message. The evidence instead points toward a portfolio strategy that includes immediate doctoral recruitment, mid-career advancement pathways, emerging-leader cultivation, and longer-term pipeline development.

5.2 Structural Interpretation of Findings

The results are best interpreted structurally rather than psychologically. The model does not reveal why any individual librarian wants or does not want doctoral education. It reveals how expressed interest is distributed across a professional field. This field is shaped by accumulated capital, access to educational opportunity, institutional expectations, and professional mobility pathways (Abbott, 1988; Bourdieu, 1986; Perna, 2006).

From this perspective, higher modeled interest should not be read as intrinsic motivation alone. It may reflect greater access to graduate education, stronger professional networks, clearer research identity, institutional incentives, or more visible returns to doctoral study. Lower modeled interest may reflect lower feasibility, weaker access, uncertain costs, limited mentoring, or weaker exposure to research careers. The implication is that doctoral pipeline planning should address structural conditions, not merely market to individuals.

5.3 Workforce Inequality and Doctoral Pathways

The findings suggest that professional opportunity structures shape the visibility of doctoral interest. Geographic location, institutional sector, and workplace context may influence whether doctoral education appears useful, accessible, and institutionally supported. These are not merely background variables. They are conditions through which professional capital becomes available or constrained.

This interpretation shifts the institutional question from “Who is most interested?” to “Where are doctoral pathways already visible, and where must they be made more feasible?” A doctoral program that ignores geographic and institutional asymmetry may recruit efficiently in the short term while reproducing long-standing inequalities in graduate access. A program that treats segmentation as a diagnostic of unequal opportunity can instead design distributed advising, scholarship pathways, remote participation options, and research mentorship pipelines.

5.4 Implications for PhD in LIS Program Design

The persona structure suggests that doctoral program design should be differentiated from the beginning. For advanced academic professionals, the program should foreground research supervision, publication pathways, methodological training, and opportunities to convert professional expertise into scholarly contribution. This audience may ask whether the program has sufficient academic depth, supervisory capacity, research culture, and institutional prestige to justify the opportunity cost of doctoral study.

For public-sector advancement candidates, the program should connect doctoral study to leadership, policy, institutional improvement, and evidence-based service development. The recruitment message should not present the PhD as an abstract academic credential alone. It should clarify how doctoral training can support public knowledge institutions, government service, organizational decision-making, and national LIS capacity.

For emerging academic-service leaders, the key design challenge is pathway clarity. This group may be professionally motivated but may need stronger advising on the transition from professional practice to research formation. A PhD program can serve this audience by offering structured research-preparation modules, mentoring, writing support, scholarship information, and clear expectations about the relationship between prior graduate study and doctoral preparation.

For early-career credential builders, the immediate strategic task is not aggressive PhD recruitment. It is pipeline development. This group may benefit more from bridge advising, MLIS-to-PhD roadmaps, research exposure, peer cohorts, and staged milestones. Treating this audience as a developmental pipeline rather than an immediate enrollment market can help the institution build long-term capacity without overstating current expressed interest.

5.5 Implications for Recruitment and Pipeline Strategy

The results argue against a single broad recruitment campaign. A generic message about the availability of a PhD in LIS would likely underperform because it would ignore the different motivations, constraints, and readiness profiles identified in the analysis. Instead, recruitment should be segmented by persona.

Immediate recruitment can prioritize high-readiness professionals with advanced preparation and strong alignment with academic or leadership trajectories. Mid-career recruitment can emphasize flexible scheduling, workplace relevance, policy contribution, and institutional leadership. Emerging-leader recruitment can stress mentoring, research identity formation, and scholarship pathways. Pipeline recruitment can focus on long-term advising and preparation rather than immediate application.

This approach also changes how recruitment success should be measured. A strategic PhD recruitment plan should track not only applications and enrollments, but also advising contacts, bridge-program participation, research-preparation engagement, scholarship inquiries, and movement from early interest to application preparation. In other words, the personas suggest a doctoral pipeline with multiple developmental stages rather than a simple conversion model.

5.6 Interpretable ML for Institutional Planning

A major interpretive boundary of this study is that the model should not be used to rank individuals for opportunity. Its proper role is decision support for institutional planning. This distinction is consistent with the broader caution in predictive analytics: models can support planning when they are transparent about their target, validation, and limitations, but they become problematic when treated as deterministic judgments about individual futures (G. S. Collins et al., 2015; Sghir et al., 2022). It is also consistent with arguments for interpretable models in high-stakes decision contexts and with ethical warnings that learning analytics can reshape institutional responsibility if students or professionals become objects of surveillance rather than agents (Rudin, 2019; Slade & Prinsloo, 2013).

In this study, machine learning is valuable because it reveals structure in the professional landscape. It helps answer questions such as: where is expressed doctoral study interest concentrated, what kinds of professional profiles are associated with that interest, and what support strategies might fit different audiences? These are planning questions, not eligibility questions. The ethical use of the model is therefore expansive: it should help the institution design more inclusive and responsive pathways, not restrict outreach to those with the highest modeled likelihood.

This distinction is particularly important for LIS education because doctoral participation is not a fixed individual trait. It is shaped by institutional support, mentoring, funding, workload, family responsibilities, geographic access, professional recognition, and prior exposure to research. A person with lower modeled likelihood may become an excellent doctoral student if provided with the right pathway. Conversely, a person with high modeled likelihood may face barriers not visible in the data. The model should be read as a map of strategic conditions, not as a verdict on individual capacity.

5.7 Equity and Access Considerations

The persona findings have an equity dimension. If recruitment focuses only on the most immediately reachable candidates, the program may reproduce existing inequalities in access to graduate preparation, research mentoring, and professional advancement. A doctoral program that aims to strengthen the national LIS workforce must therefore distinguish between observed expressed interest and potential for future participation. Observed interest identifies where current signals are strongest; potential identifies where institutional support could widen access.

This is where the early-career and emerging-leader personas become strategically important. Their lower or more varied modeled interest should not be interpreted as low value. Rather, these groups point to the need for bridge structures: research boot camps, writing clinics, faculty mentorship, scholarship advising, cohort-based preparation, and flexible study planning. Such supports can make doctoral education more accessible while also expanding the future applicant pool.

The caution from data-driven persona research is relevant here. Personas can help institutions empathize with audience segments, but they can also flatten complex lives into overly neat profiles if used carelessly (Salminen et al., 2021). The ethical response is to keep persona labels provisional, evidence-linked, and revisable. Personas should guide program design conversations, not replace direct engagement with prospective students.

5.8 Ethical Limits of Predictive Segmentation

Predictive segmentation should remain a planning instrument with explicit governance boundaries. It should not become a hidden sorting mechanism that allocates attention only to those already advantaged by prior education, institutional support, or geographic proximity. Fairness diagnostics, calibration checks, and stakeholder review are therefore not technical embellishments. They are safeguards against turning structural inequality into apparently neutral model output.

The persona layer has the same ethical limit. Personas are useful only when treated as bounded planning abstractions. They should support questions such as which advising pathways, scholarship messages, or research-preparation opportunities might be needed. They should not be used to infer individual motivation, merit, or likelihood of success.

5.9 Practical Governance Implications

If institutions use predictive segmentation for doctoral pipeline planning, governance should be explicit before deployment. A responsible governance process should document the model purpose, prohibit individual ranking, disclose the limitations of expressed-interest data, review subgroup performance, and specify who may access outputs. It should also require periodic recalibration, stakeholder review of persona language, and a process for retiring or revising personas that become misleading.

The practical implication is that analytics should expand institutional responsibility rather than automate it. A school using this framework should ask where additional advising, funding, remote access, or research preparation is needed, not which individuals deserve attention. Transparency, fairness monitoring, and non-exclusionary use are therefore core conditions for responsible adoption.

5.10 Contributions to LIS Workforce Research

This study contributes to LIS education research by showing how workforce evidence can be translated into a reproducible doctoral-planning framework. The literature has long positioned LIS education as a bridge between universities and the profession (Birdi, 2022), and has emphasized the need for educational programs to respond to workforce needs and skills gaps (Katuli-Munyoro & Mutula, 2017). This article extends that discussion by demonstrating how machine learning can operationalize workforce responsiveness without reducing education planning to crude demand estimation or individualized targeting.

The article also contributes methodologically. It integrates prediction, segmentation, and persona synthesis in a single workflow. Predictive performance establishes whether the data contain usable signal. Feature importance clarifies which broad dimensions structure the signal. Segmentation organizes the professional population into strategic groups. Personas translate those groups into program-design and recruitment implications. This sequence provides a replicable template for other LIS schools considering new graduate programs or evaluating advanced-degree interest.

The theoretical contribution is to connect doctoral pipeline planning to professional capital and structural access. The study demonstrates that expressed interest in doctoral LIS education can be analyzed as a patterned workforce phenomenon rather than as a simple individual preference or marketing target. This reframing moves the contribution beyond institutional recruitment analytics and toward a theory-informed account of how professional opportunity structures become visible through predictive segmentation.

Finally, the study contributes to the responsible use of analytics in LIS. Rather than presenting machine learning as a neutral oracle, the article frames it as an interpretive tool embedded in institutional decision-making. The value of the model depends not only on performance metrics, but also on whether its outputs are understandable, calibrated, ethically bounded, and useful for expanding educational opportunity.

5.11 Limitations

Several limitations should guide interpretation. First, the outcome represents expressed doctoral study interest, not application, admission, enrollment, persistence, or completion. Expressed interest is temporally unstable and may shift with tuition, delivery mode, scholarship availability, family responsibilities, workload, or institutional encouragement. The findings therefore support interest exploration and recruitment planning, but they do not forecast actual cohort size. Second, the analysis relies on secondary survey data collected before the proposed program exists as a concrete offering. Respondents could not evaluate specific tuition levels, delivery modes, faculty expertise, scholarship packages, or admission requirements.

Third, the outcome is self-reported and may be affected by social desirability, institutional signaling, misunderstanding of doctoral requirements, or aspiration suppression under structural constraint. Fourth, the modeling workflow depends on the quality and completeness of the available survey evidence. Missingness, noisy responses, unobserved socioeconomic constraints, and uneven representation across regions or institutional groups can affect both model performance and persona detail. Fifth, feature importance should be interpreted globally and cautiously. It indicates which predictors are useful to the fitted model, not which factors causally produce doctoral interest.

Sixth, the segmentation and persona labels are interpretive. They are intended to support strategy, not to describe fixed identities. Seventh, the present results are context-bound to the Philippine LIS workforce and should not be generalized to other national systems without replication. Future work should validate the personas through interviews, focus groups, prospective-student consultations, or pilot recruitment activities. Such validation would help determine whether the model-derived profiles resonate with librarians’ own accounts of their motivations, constraints, and doctoral ambitions.

5.12 Future Research and Validation

The next research step is longitudinal validation. Once a PhD in LIS program is offered, future studies should track whether expressed interest, modeled probability, segment membership, advising participation, and bridge-program engagement predict application, admission, enrollment, persistence, and completion. This would allow the framework to move from cross-sectional interest classification toward validated doctoral pipeline analysis.

A second priority is qualitative triangulation. Interviews, focus groups, and participatory persona review with librarians from different regions and institutional contexts would test whether the personas are recognizable, useful, and non-stereotyping. A third priority is methodological replication using probabilistic segmentation, calibration refinement, and more formal fairness assessment. Such work would strengthen both the scholarly contribution and the ethical governance of predictive segmentation in LIS education.

5.13 Strategic Takeaway

The central takeaway is that a PhD in LIS recruitment strategy should be evidence-informed, segmented, and developmental. The strongest current expressions of interest may be concentrated among advanced and institutionally positioned professionals, but the long-term success of the program depends on a broader ecosystem of advising, bridge preparation, financial support, and research identity formation. Machine learning helps reveal where those strategies might be directed; it does not replace academic judgment, ethical recruitment, or direct engagement with the profession.

References

Abbiati, M., Nendaz, M., & Cerutti, B. (2024). Exploring medical career choice to better inform swiss physician workforce planning: Protocol for a national cohort study. JMIR Research Protocols. https://doi.org/10.2196/53138

Abbott, A. (1988). The system of professions: An essay on the division of expert labor. University of Chicago Press. https://doi.org/10.7208/chicago/9780226189666.001.0001

Almahri, F. A. A. J., Bell, D., & Arzoky, M. (2019). Personas design for conversational systems in education. Informatics, 6(4), 46. https://doi.org/10.3390/informatics6040046

Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning: Limitations and opportunities. MIT Press.

Birdi, B. (2022). The contribution of library and information science education to decolonising. In Narrative expansions: Interpreting decolonisation in academic libraries (pp. 91–104). Facet. https://doi.org/10.29085/9781783304998.009

Bourdieu, P. (1986). The forms of capital. In J. G. Richardson (Ed.), Handbook of theory and research for the sociology of education (pp. 241–258). Greenwood.

Collins, G. S., Reitsma, J. B., & Altman, D. G. (2015). Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. Journal of Clinical Epidemiology, 68(2), 112–121. https://doi.org/10.1016/j.jclinepi.2014.11.010

Collins, L. M., & Lanza, S. T. (2009). Latent class and latent transition analysis. Wiley. https://doi.org/10.1002/9780470567333

Dorado, D. A. D. (2024). Exploring the landscape of librarianship in the philippines: Establishing the profession’s population parameter estimates. Journal of Librarianship and Information Science, 57(3), 733–745. https://doi.org/10.1177/09610006241240485

Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132. https://doi.org/10.1146/annurev.psych.53.100901.135153

Freidson, E. (2001). Professionalism: The third logic. University of Chicago Press.

Hernandez-Blanco, A., Herrera-Flores, B., & Tomas, D. (2019). A systematic review of deep learning approaches to educational data mining. Complexity, 2019(1). https://doi.org/10.1155/2019/1306039

Katuli-Munyoro, P., & Mutula, S. M. (2017). Redefining library and information science education and training in zimbabwe to close the workforce skills gaps. Journal of Librarianship and Information Science, 51(4), 915–926. https://doi.org/10.1177/0961000617748472

Knox, J. (2017). Data power in education: Exploring critical awareness with the learning analytics report card. Television & New Media, 18(8), 734–752. https://doi.org/10.1177/1527476417690029

Macgilchrist, F. (2021). What is ’critical’ in critical studies of edtech? Three responses. Learning, Media and Technology, 46(3), 243–249. https://doi.org/10.1080/17439884.2021.1958843

Obille, K. L. B., & Dorado, D. A. D. (2022). Philippine librarians census [dataset]. Zenodo. https://doi.org/10.5281/zenodo.6864788

Panigrahi, P. (2010). Library and information science education in east and north-east india: Retrospect and prospects. DESIDOC Journal of Library & Information Technology, 30(5), 32–47. https://doi.org/10.14429/djlit.30.613

Park, D.-H., & Kang, J. (2022). Constructing data-driven personas through an analysis of mobile application store data. Applied Sciences, 12(6), 2869. https://doi.org/10.3390/app12062869

Perna, L. W. (2006). Studying college access and choice: A proposed conceptual model. In Higher education: Handbook of theory and research (pp. 99–157). Kluwer Academic Publishers. https://doi.org/10.1007/1-4020-4512-3_3

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215. https://doi.org/10.1038/s42256-019-0048-x

Salminen, J., Jung, S., & Jansen, B. J. (2021). Are data-driven personas considered harmful? Persona Studies, 7(1), 48–63. https://doi.org/10.21153/psj2021vol7no1art1236

Sghir, N., Adadi, A., & Lahmer, M. (2022). Recent advances in predictive learning analytics: A decade systematic review (2012–2022). Education and Information Technologies, 28(7), 8299–8333. https://doi.org/10.1007/s10639-022-11536-0

Slade, S., & Prinsloo, P. (2013). Learning analytics: Ethical issues and dilemmas. American Behavioral Scientist, 57(10), 1510–1529. https://doi.org/10.1177/0002764213479366