Applying topic modeling in HRD Research: Revealing constructs and conceptual relationships from 3,986 peer reviewed journal article abstracts

Ying Feng
12/10/2020

The Context and the purpose

The history of quantitative text analysis is not short: e.g. content analysis starting in the 1930s, information retrieval, scientomatric analysis starting in the 1950s. First citation analysis of HRD publications was by Sleezer and Sleezer (1998).

plot of chunk unnamed-chunk-1

Definition of Terms

plot of chunk unnamed-chunk-2

token & tokenization: the unit of text analysis; split the text into tokens
corpus|corpora: collection of text
(latent) topics: vague definition
topic modeling: identify the meaning of words based on the context and assigning topics to documents.
- advantages

Theory Building

Publishing on professional journals is central to productivity in research and set of which has been long considered as central institution of science (e.g. Dorelan, 1988). Meanwhile, abstract is written as summary of journal articles.
Thus the analysis of journal article abstract can generate a picture of the combination of explicit knowledge of the field of HRD.

plot of chunk unnamed-chunk-3

Data collection & Wrangling

5 HRD leading journals, 3986 abstracts, 1977-2020, downloaded from Scopus & combined using Excel quiry editor

plot of chunk unnamed-chunk-4

Perform tokenization & data wrangling

Iterative process of remove stop words, punctuation, proper nouns, etc. using r (quanteda & tidytext package )

plot of chunk unnamed-chunk-5

Document feature matrix (DFM)

Create the bag-of-words model/bag of words analysis – document-feature matrix (DFM).
The DFM shows a large model with 3064 observations/texts and 11,828 features. However, the value is 0 for most of the cells.
there are two big issues: the sparsity problem (e.g. 99.6% sparsity of this model) and the dimensionality problem.

plot of chunk unnamed-chunk-6

Topic modeling

Using structural topic modeling technique (stm package in r) to extract latent topics of the corpus.

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

(Intrinsic) model evaluation

In topic modeling, we don't know the number of topics (k) ahead of time, or research says there is no “right” answer for the number of topics that is appropriate for any given corpus.
Meanwhile, the number of topics plays critical role in representing corpus. Therefore, we need to fit topic models with different k and evaluate things (e.g. see the four analysis plot below) to find a “good number”.

plot of chunk unnamed-chunk-9

Limitations

collect & prepare data (API, crawl…)
how much supervision for document classification (unsupervised, semisupervised, & supervised )
sparsity & dimensionality problems (the curse of text analysis)
“reading tea leaves” (over-interpret the story told by the results)

plot of chunk unnamed-chunk-10

Discussion & Implications

Topic modeling is not an objective representation of “true meanings” of the text or replace human judgment, but it can assist in reading and interpretation of large corpus, especially for the long text with consistent structure like journal article abstracts.
The findings of this project cannot present the “true representation of explicit knowledge or classification” but probe the latent constructs and conceptual relationships in HRD research.

plot of chunk unnamed-chunk-11

References

Dorelan, P. (1988) 'Testing structural-equivalence hypotheses in a network of geographical journals', Journal of the American Society for Information Science, 39(2): 79-85.

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media, Inc.

Sleezer, C. M. & Sleezer, J. H. (1998) The status of HRD research in the United States from 1980 to 1994, Human Resource Development International, 1:4, 451-460. DOI: 10.1080/13678869800000056

Nonaka, I., & Takeuchi, H. (1995). The knowledge-creating company: How Japanese companies create the dynamics of innovation. New York: Oxford University.