Predicting Doctoral Study Aspirations Among Librarians

A Machine Learning Analysis

Author
Affiliation

Dan Anthony Dorado

School of Library and Information Studies, University of the Philippines

Published

July 23, 2025

1 Introduction

1.1 Doctoral Aspirations in LIS

Understanding the motivations behind pursuing doctoral aspirations in Library and Information Science (LIS) reflects broader trends in education, research, and professional development. The exploration of these aspirations is pivotal for multiple reasons, predominantly linked to the implications for the future of the profession, the creation of knowledge, and diversity within academic programs.

First, LIS doctoral programs contribute significantly to the production of researchers and educators, thereby impacting the next generation of master’s students in the field. As highlighted by Sugimoto et al.[1], the focus on doctoral education encompasses not only the development of scholars but also the formation of faculty members who will guide future practitioners in LIS. The topics explored in doctoral dissertations can influence research priorities and educational methodologies, fostering a dynamic educational environment. Additionally, studies have identified key challenges in doctoral education, such as lack of financial support, limited exposure to research opportunities, and inadequate mentoring, which are crucial for nurturing aspiring LIS scholars [2].

Further extending this discussion, Wang et al. [3] highlight trends in LIS doctoral dissertations in China, emphasizing the importance of understanding how research topics in this field intersect with broader academic domains. Such insights are essential as they provide a clearer picture of the current landscape and future direction of LIS education and research. Similar bibliometric studies indicate evolving trends in dissertation topics and the increasing interdisciplinary nature of LIS research [4].

Moreover, the motivations of doctoral students themselves greatly influence their academic pathways. Research by Hands [5] indicates that intrinsic motivations, along with identified regulation (a form of extrinsic yet autonomous motivation), largely drive doctoral students in LIS. This intrinsic motivation is vital for fostering resilience and persistence in doctoral study, which are particularly necessary given the demanding context of academic research.

In light of the growing need for diversity within LIS programs, research has also pointed to the experiences of underrepresented groups, particularly Black doctoral students, emphasizing the necessity for inclusive practices and targeted mentorship to enhance their academic journey and retention [6]. Studies focused on peer relationships and mentorship in doctoral programs reinforce the importance of social support networks, which can profoundly impact students’ experiences and professional aspirations [7].

The trends outlined above reflect a critical understanding of LIS doctoral aspirations, emphasizing the intertwined nature of education, research, and professional expectations. By examining these factors, stakeholders can better align doctoral programs with the needs of the profession and ensure that they actively contribute to a rich, inclusive academic community.

1.2 Research Questions

Despite the growing need for advanced research skills and scholarly leadership in the library and information science (LIS) profession, little is known about what drives or hinders librarians to pursue doctoral studies—particularly in contexts outside North America and Europe. In the Philippines, where librarianship is rapidly evolving amid new demands for digital transformation, understanding the pathways to doctoral education is essential for promoting innovation, capacity-building, and equitable access to professional advancement.

This study is guided by the research question:

Which factors best predict the intent of librarians in the Philippines to pursue doctoral (Ph.D.) studies within the next five years?

Building on this, we further explore related sub-questions:

  • What demographic, professional, and institutional characteristics are most strongly associated with an expressed intention to undertake doctoral education?

  • How do these predictors interact, and are there patterns that suggest systemic inequities or barriers in access to advanced LIS education?

  • Can machine learning models effectively identify high-potential candidates for doctoral study, and what are the implications for policy and institutional support?

By systematically analyzing survey data from a nationwide census of librarians, this study aims to provide actionable insights for LIS schools, policymakers, and professional organizations seeking to nurture talent, address access gaps, and support the future development of the field.

3 Methodology

3.1 Data Source

This study utilizes data from the Philippine Librarians Census, [19] a landmark survey conducted by the University of the Philippines School of Library and Information Studies from November 2018 to October 2019. The dataset, made openly available via Zenodo, represents the most comprehensive occupational census of professional librarians in the country to date. It captures a rich array of demographic, educational, employment, and career aspiration variables from over 600 respondents across the Philippine archipelago.

The census was designed to establish baseline data for LIS workforce planning and to inform educational, professional, and policy decisions. Variables include age, years of professional service, current and previous education, gross and net salary, institutional type (e.g., academic, government, special, private), job position, region, and current enrollment in further studies, as well as self-reported plans regarding advanced academic degrees.

3.2 Data Preparation

To prepare the dataset for predictive modeling, a series of cleaning and preprocessing steps were undertaken:

  • Removal of Non-predictive and Target-Leaking Variables: Columns not relevant to the prediction task, such as unique respondent identifiers and variables that could leak future information about the outcome (e.g., “currently pursuing PhD”), were excluded.

  • Filtering Ambiguous Responses: All rows containing “Unknown” responses in any categorical field were removed to maximize data integrity and interpretability.

  • Imputation of Missing Numeric Data: For continuous variables (such as age, years of service, salaries), missing values were imputed with the median of each respective feature, a robust choice given the non-normal distribution often seen in such workforce data.

  • Factor Conversion: All categorical variables were converted to factors in R to enable compatibility with both logistic regression and random forest models.

  • Definition of Outcome Variable: The primary outcome, “intent to pursue Ph.D. in the next five years,” was operationalized as a binary variable: 1 for those who selected “Ph.D.” as a planned credential, and 0 for all others. Cases with unclear or missing intentions were omitted from analysis.

  • Train/Test Split: To evaluate predictive performance, the dataset was randomly split into training (80%) and testing (20%) sets, with stratification by the outcome variable to maintain class balance.

3.3 Modeling Approach

The study employed two supervised machine learning algorithms to predict doctoral study intent:

  • Logistic Regression: Used as a baseline, interpretable model to estimate the direction and statistical significance of each predictor’s association with the outcome.

  • Random Forest: A robust ensemble approach, capable of capturing non-linear relationships and feature interactions, and providing measures of variable importance.

Feature Selection

Feature inclusion was guided by literature, practical relevance, and data quality. Predictors included demographic factors (age, region), professional characteristics (years of service, position, institution type), and educational status (current enrollment). Features with excessive missingness or direct overlap with the target were excluded.

Hyperparameter Tuning and Sensitivity Analysis

Random forest model hyperparameters were tuned using 5-fold cross-validation in the caret R package, [20] focusing on the mtry parameter (number of variables sampled at each split). Additional sensitivity analyses assessed model stability across different numbers of trees (ntree). The best-performing hyperparameters were identified based on test-set AUC (area under the ROC curve) and accuracy.

Model Evaluation

Model performance was evaluated on the held-out test set using:

  • Accuracy

  • AUC (Area Under the ROC Curve)

  • Confusion Matrix

  • Sensitivity/Specificity

All analyses were conducted using open-source R packages (tidyverse [21], randomForest [22], caret [20], broom [23], pROC [24], and ggplot2 [25]).

Reproducibility and Open Science

To support transparency, peer review, and future research, all code, data processing scripts, and key analysis workflows for this study are openly available in a dedicated GitHub repository: https://github.com/dddorado/PhDAspirationLIS. This repository contains R scripts for data cleaning, modeling, hyperparameter tuning, and visualization, as well all the files for reproducing all results and figures presented in this paper. By providing complete analytical reproducibility, we encourage other researchers to validate, adapt, or extend our work for additional contexts or comparative studies. This commitment aligns with open science best practices and the goals of the ICADL community.

4 Results

4.1 Class Balance and Descriptive Statistics

After data cleaning and preprocessing, the analytic sample comprised 276 librarians with complete information (see Table 1). Among these, approximately 18% expressed an intent to pursue a Ph.D. within the next five years, highlighting the relative rarity of doctoral aspirations in the Philippine LIS workforce. The remaining respondents indicated other career plans or were not considering further study. The class distribution was moderately imbalanced, which was addressed through careful model evaluation

Table 1. Class balance in training data
Ph.D. Intent Count
No 188
Yes 88

Key demographic and employment characteristics of the sample included a wide range of ages, years of professional service, institutional types (e.g., academic, government, special), and geographic locations. Median age was in the early 40s, and most respondents were employed in academic or government libraries.

4.2 Model Performance and Tuning

Both random forest and logistic regression models were trained to predict Ph.D. study intent using the selected predictors. Hyperparameter tuning for the random forest (varying mtry between 2 and 6) was performed using 5-fold cross-validation, optimizing for area under the ROC curve (AUC). The optimal value was found at mtry = 3, yielding a cross-validated AUC of 0.72 seen in Figure 1.

Figure 1. Cross-validated ROC (AUC) vs. mtry in Random Forest tuning.

Sensitivity analysis on the number of trees (ntree) demonstrated stable model performance (see Table 2), with test set accuracy ranging from 0.65 to 0.69 across 100 to 1000 trees. Model selection thus prioritized parsimony without sacrificing predictive power.

Table 2. Random Forest Test Set Accuracy Across ntree
ntree accuracy
100 0.6911765
250 0.6617647
500 0.6764706
1000 0.6470588

On the held-out test set (see Table 3), both models showed high sensitivity (≥0.86) but relatively low specificity (0.23), reflecting a greater ability to identify librarians not intending to pursue a Ph.D., but less precision in flagging those who are aspiring to advanced study. Confusion matrices for both models are shown below.

Random Forest Test Set Accuracy: 0.69
Random Forest Test Set AUC: 0.686
Table 3. Random Forest Test Set Confusion Matrix
Prediction Reference Freq
No No 42
Yes No 4
No Yes 17
Yes Yes 5

4.3 Feature Importance and Predictors

Random Forest

Figure 2 shows the top 10 most important features according to the random forest model (by Mean Decrease in Gini). These findings suggest that more experienced and higher-earning librarians, as well as those already engaged in postgraduate study, are more likely to aspire to a Ph.D.

Figure 2. Barplot of top 10 random forest feature importances.

Logistic Regression

Figure 3. Barplot of top 10 logistic regression coefficients (direction and magnitude).

Figure 3 displays the top 10 logistic regression coefficients, with color coding to indicate positive or negative association. Notably:

  • Positive predictors: Current enrollment in further study (enrolledYes), management positions (positionManagement), and working in ICT or academic institutions.

  • Negative predictors: Affiliation with heritage work or special libraries, certain regions (e.g., Visayas), and employment in public/school library sectors.

Effect sizes for several predictors were large, though some variables did not reach statistical significance after correcting for multiple comparisons. The direction of effects provides actionable insight for identifying groups with higher or lower likelihood of doctoral aspirations.

5 Discussion

5.1 Interpretation of Top Predictors and Equity Considerations

The analysis identified several key predictors of librarians’ intent to pursue Ph.D. studies, including age, years of service, salary, current enrollment in postgraduate study, job position, institutional type, and region. Among these, being currently enrolled in further studies and holding a management or supervisory role stood out as strong positive predictors of doctoral aspirations. Conversely, librarians working in certain sectors, such as heritage work or special libraries, or those employed in public or school libraries, were less likely to express intent to pursue a doctorate.

From an equity perspective, these results suggest persistent disparities in access to advanced LIS education. Librarians in higher-paying, urban, or academic roles appear to have greater opportunity and motivation to consider doctoral study, while those in lower-resourced institutions or outside major centers face more significant barriers. Such barriers may include limited institutional support, fewer professional development opportunities, or a lack of mentorship and role models. The influence of ongoing academic engagement also underscores the importance of pathways that encourage continued education and professional growth, potentially widening the gap between those with and without institutional resources.

5.2 Comparison with Prior Work

These findings align with prior research showing that institutional environment, professional status, and access to academic networks significantly shape career trajectories in LIS [1, 8]. The positive association between management roles and doctoral intent mirrors trends in other countries, where leadership aspirations and advanced study are closely linked. Similarly, the observed disparities by region and sector reflect global challenges of equity in doctoral education, as discussed in the literature on underrepresented groups and institutional support [10].

The utility of machine learning models for identifying predictors is consistent with previous studies in educational data mining and workforce analytics [12, 14], further validating the methodological approach. However, the moderate model performance \((AUC ≈ 0.69)\) also suggests that non-quantitative factors—such as personal motivation, family responsibilities, or organizational culture—may play a substantial role that is difficult to capture with structured survey data alone.

5.3 Implications for Policy, Institutions, and the LIS Field

The findings highlight actionable opportunities for policymakers and educational leaders to promote greater equity in access to doctoral study. Targeted interventions—such as scholarships, flexible study arrangements, and mentorship programs—could help bridge gaps for librarians in underrepresented regions, sectors, and roles. Institutions might also consider developing “bridge” or preparatory programs to support the transition from professional to academic pathways, particularly for those outside the academic mainstream. At a broader level, professional associations could use predictive analytics to proactively identify and support potential doctoral candidates.

For the LIS field, investing in the development of a diverse doctoral pipeline is crucial for sustaining research capacity, innovation, and the ability to address emerging challenges in digital librarianship and knowledge equity.

5.4 Limitations

Several limitations should be acknowledged. First, the study relies on self-reported data, which may be subject to social desirability bias or inaccurate recall. The cross-sectional design limits inferences about causality and may not fully capture dynamic career decision processes. The data, while comprehensive for the Philippines, may not generalize to other national or cultural contexts. Furthermore, the “intent” to pursue doctoral study does not guarantee eventual enrollment or completion. Finally, the moderate predictive accuracy of the models points to the need for additional variables—potentially from qualitative or longitudinal sources—to fully explain doctoral aspirations.

5.5 Recommendations for Future Research

Future studies should explore integrating richer contextual and qualitative data to uncover motivations, barriers, and enablers of doctoral study that go beyond demographic and employment variables. Longitudinal research could clarify how aspirations evolve over time and what factors predict actual enrollment and completion. Expanding similar analyses to other countries or professional groups would allow for valuable cross-contextual comparisons. Additionally, exploring the impact of targeted interventions using quasi-experimental or mixed methods could provide robust evidence for effective policy and program design.

6 Reference

1.
Sugimoto, C.R., Li, D., Russell, T.G., Finlay, S.C., Ding, Y.: The shifting sands of disciplinary development: Analyzing north american library and information science dissertations using latent dirichlet allocation. Journal of the American Society for Information Science and Technology. 62, (2011). https://doi.org/10.1002/asi.21435.
2.
Rehman, S., Chaudhry, A.S., Alasousi, H.O.: Assessing the need for PhD in information studies: Stakeholder insights. OALib. 06, (2019). https://doi.org/10.4236/oalib.1105322.
3.
Wang, T., Lund, B., Dow, M.: A bibliometrics study of library and information science doctoral dissertations in china from 2011 to 2020. Education for Information. 38, (2022). https://doi.org/10.3233/EFI-211545.
4.
Zareef, M., Arif, M., Jabeen, M.: Research trends in LIS: The case of doctoral research in pakistan, 1981–2021. Journal of Librarianship and Information Science. 56, (2024). https://doi.org/10.1177/09610006231161331.
5.
Hands, A.S.: What’s your type? An examination of first-year doctoral student motivation. Education for Information. 36, (2020). https://doi.org/10.3233/EFI-200373.
6.
Franklin, K.Y.: Re-examining the socialization of black doctoral students through the lens of information theory. Journal of Critical Library and Information Studies. 3, (2020). https://doi.org/10.24242/jclis.v3i3.146.
7.
Lee, J., Anderson, A., Burnett, G.: Peer relationships and mentoring between LIS doctoral students: A qualitative approach. Journal of Librarianship and Information Science. 49, (2017). https://doi.org/10.1177/0961000615592024.
8.
McNelis, A.M., Dreifuerst, K.T., Schwindt, R.: Doctoral education and preparation for nursing faculty roles. Nurse Educator. 44, (2019). https://doi.org/10.1097/NNE.0000000000000597.
9.
Chang, Y.W., Huang, M.H.: A study of the evolution of interdisciplinarity in library and information science: Using three bibliometric methods. Journal of the American Society for Information Science and Technology. 63, (2012). https://doi.org/10.1002/asi.21649.
10.
Burford, J., Mitchell, C.: Varied starting points and pathways. Reconceptualizing Educational Research Methodology. 10, (2019). https://doi.org/10.7577/rerm.3242.
11.
Jabeen, M., Yun, L., Rafiq, M., Jabeen, M.: Research productivity of library scholars. New Library World. 116, (2015). https://doi.org/10.1108/nlw-11-2014-0132.
12.
Tjahyaningtijas, H.P.A., Husin, N., Habib, H.A., Asmunin, A., Wibawa, R.P., Sumarno, A., Paragas, J.R., Susantini, E.: Machine learning on academic education: Bibliometric studies. In: E3S web of conferences (2023). https://doi.org/10.1051/e3sconf/202345002010.
13.
Samandarov, E., Abduraimov, D., Xudayberdiyev, A., Normatova, M., Butaboyev, A., To‘ychiyeva, Z., Taniberdiyev, A.: Comprehensive review of educational platform for assessing and classifying students’ knowledge levels utilizing machine learning. https://doi.org/10.1117/12.3073010. 13662, 240–250 (2025). https://doi.org/10.1117/12.3073010.
14.
Ahrens, A., Zascerinska, J., Melnikova, J., Andreeva, N.: An innovative method for data mining in higher education. In: Rural environment. Education. Personality. (REEP) : Proceedings of the 11th international scientific conference (2018). https://doi.org/10.22616/reep.2018.001.
15.
You, S., Joo, S., Katsurai, M.: Data mining topics in the discipline of library and information science: Analysis of influential terms and dirichlet multinomial regression topic model. Aslib Journal of Information Management. 76, (2024). https://doi.org/10.1108/AJIM-05-2022-0260.
16.
Seo, K., Tang, J., Roll, I., Fels, S., Yoon, D.: The impact of artificial intelligence on learner–instructor interaction in online learning. International Journal of Educational Technology in Higher Education. 18, (2021). https://doi.org/10.1186/s41239-021-00292-9.
17.
Patiño, G.A., Roberts, L.W.: The need for greater transparency in journal submissions that report novel machine learning models in health professions education. Academic Medicine. 99, 935–937 (2024). https://doi.org/10.1097/ACM.0000000000005793.
18.
Babu, Dr.S., Gibreel, Dr.M.O.M.: Utilization of educational data mining for classifying and predicting students’ performance, dropouts as well as teachers’ performance. Technoarete Transactions on Application of Information and Communication Technology(ICT) in Education. 1, (2022). https://doi.org/10.36647/ttaicte/01.01.a001.
19.
Obille, K.L.B., Dorado, D.A.D.: Philippine librarians census. https://doi.org/10.5281/ZENODO.6864788.
20.
Kuhn, Max: Building predictive models in r using the caret package. Journal of Statistical Software. 28, 1–26 (2008). https://doi.org/10.18637/jss.v028.i05.
21.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H.: Welcome to the tidyverse. Journal of Open Source Software. 4, 1686 (2019). https://doi.org/10.21105/joss.01686.
22.
Liaw, A., Wiener, M.: Classification and regression by randomForest. R News. 2, 18–22 (2002).
23.
Robinson, D., Hayes, A., Couch, S.: Broom: Convert statistical objects into tidy tibbles. (2024).
24.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., Müller, M.: pROC: An open-source package for r and s+ to analyze and compare ROC curves. BMC Bioinformatics. 12, 77 (2011).
25.
Wickham, H.: ggplot2: Elegant graphics for data analysis. Springer-Verlag New York (2016).