Examining Classification of Journal Articles Using Topic Modeling

1. PREPARE

INTRODUCTION

This independent analysis seeks to examine the efficiency of topic modeling approaches in automatic classification of topics sourced from collections of multi-disciplinary research articles. Empirical studies suggest that topic modeling presents the potential of supporting effective synthesis and recommendation of research articles by matching with the semantic content rather than using keywords(Chaitanya & Singh, 2017). For instance, the study by Muchene et al.(2021) presents a two-stage topic modelling analysis of scientific publications at the University of Nairobi in Kenya. Such works and the case studies presented in this course piqued my interest in pursuing this topic.

Research Question

The main goal of this analysis to examine and validate the efficiency of topic modeling in classifying multi-disciplinary research topics based on the provided corpus of research articles.
RQ1: How effective is Topic Modelling in classifying multi-disciplinary journal articles?

Dataset

The dataset used in this independent analysis is from Kaggle, and it contains a collection of 8990 titles and abstracts of multi-disciplinary journal articles. The articled are sourced from the following topics:

*Computer Science

*Physics

*Mathematics

*Statistics

*Quantitative Biology

*Quantitative Finance

For the purpose of this analysis, I worked on 1148 observations, randomly selected from the source corpus.

Methodology

As stated, the main goal of this study is to examine how effective is topic modeling in classifying topics according to their original intended themes. This means I will validate and compare the results of the classified topics to the 6 key topics that are pre-provided by the source authors from Kaggle. The study follows the Data-Intensive Research workflow presented and proposed by Krumm et al.(2018).

Loading Libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.2

## Warning: package 'ggplot2' was built under R version 4.1.1

## Warning: package 'tibble' was built under R version 4.1.1

## Warning: package 'tidyr' was built under R version 4.1.1

## Warning: package 'readr' was built under R version 4.1.1

## Warning: package 'purrr' was built under R version 4.1.1

## Warning: package 'dplyr' was built under R version 4.1.1

## Warning: package 'stringr' was built under R version 4.1.1

## Warning: package 'forcats' was built under R version 4.1.2

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.1.2

library(SnowballC)

## Warning: package 'SnowballC' was built under R version 4.1.1

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 4.1.2

library(stm)

## Warning: package 'stm' was built under R version 4.1.2

library(ldatuning)

## Warning: package 'ldatuning' was built under R version 4.1.2

library(knitr)

## Warning: package 'knitr' was built under R version 4.1.2

library(LDAvis)

## Warning: package 'LDAvis' was built under R version 4.1.2

2.WRANGLE

Inorder to start the wrangling process, I imported data from kaggle to R and performing tidying and tokenization of text.

#reading csv file into r
journalarticles_data <- read_csv("data/journalarticles.csv", 
     col_types = cols(id = col_character(),
                   title = col_character(), 
                   abstract = col_character()
                   )
    )

For the purpose of this analysis and in taking into consideration my the performance of my computer, I used the titles of the research articles to perform topic modeling.

#tokenizing the titles of journal articles 
articles_tidy <- journalarticles_data %>%
  unnest_tokens(output = word, input = title) %>%
  anti_join(stop_words, by = "word")

## Warning: One or more parsing issues, see `problems()` for details

articles_tidy

## # A tibble: 8,291 x 3
##    id    abstract                                                       word    
##    <chr> <chr>                                                          <chr>   
##  1 21449 "A filter for universal real-time prediction of band-limited ~ univers~
##  2 21449 "A filter for universal real-time prediction of band-limited ~ negative
##  3 21449 "A filter for universal real-time prediction of band-limited ~ delay   
##  4 21449 "A filter for universal real-time prediction of band-limited ~ filter  
##  5 21449 "A filter for universal real-time prediction of band-limited ~ predict~
##  6 21449 "A filter for universal real-time prediction of band-limited ~ band    
##  7 21449 "A filter for universal real-time prediction of band-limited ~ limited 
##  8 21449 "A filter for universal real-time prediction of band-limited ~ signals 
##  9 21555 "In this article we describe vector bundles over projectivoid~ vector  
## 10 21555 "In this article we describe vector bundles over projectivoid~ bundles 
## # ... with 8,281 more rows

#word count 

articles_tidy %>%
  count(word, sort = TRUE)

## # A tibble: 3,566 x 2
##    word         n
##    <chr>    <int>
##  1 learning    86
##  2 based       63
##  3 networks    61
##  4 neural      50
##  5 data        49
##  6 model       48
##  7 analysis    43
##  8 network     43
##  9 deep        41
## 10 time        39
## # ... with 3,556 more rows

From the word count, it can be observed that the most common words are “learning”, " networks“,” neural" and “data”. I decided to further investigate if the titles having the word “neural” have any themes in common.

## # A tibble: 10 x 1
##    title                                                                        
##    <chr>                                                                        
##  1 ZICS: an application for calculating the stationary probability distribution~
##  2 MirBot: A collaborative object recognition system for smartphones using conv~
##  3 Co-design of jump estimators and transmission policies for wireless multi-ho~
##  4 Reconstruction of three-dimensional porous media using generative adversaria~
##  5 Solving ill-posed inverse problems using iterative deep neural networks      
##  6 Navigation of brain networks                                                 
##  7 Generalisation in humans and deep neural networks                            
##  8 Periodic Steiner networks minimizing length                                  
##  9 Photometric redshifts for the Kilo-Degree Survey. Machine-learning analysis ~
## 10 Multiple domination models for placement of electric vehicle charging statio~

The next step involved creating a Document Term Matrix that can be used with the Latent Dirichlet allocation (LDA) to model the potential topics. In this case, I used each title as a document considering that these are independent titles with unique IDs in the article collection.

#cast Document Term Matrix
articles_dtm <- articles_tidy %>%
  count(id, word) %>%
  cast_dtm(id, word, n)
articles_dtm

## <<DocumentTermMatrix (documents: 1148, terms: 3566)>>
## Non-/sparse entries: 8162/4085606
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

From the articles_dtm object, it can be observed that a total of 1148 documents and 3566 terms are included in the matrix. The sparsity is 100% which refers to the proportion of sparse entries in the document term matrix.

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

## <<DocumentTermMatrix (documents: 1148, terms: 3566)>>
## Non-/sparse entries: 8162/4085606
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

For the context of this independent analysis, I opted to perform stemming inorder to conflate words with related meanings and reduce the vocabulary in general. I therefore used the textProcessor function for preprocessing titles for further use with structural topic modeling algorithm.

#textProcessor
temp <- textProcessor(journalarticles_data$title, 
                    metadata = journalarticles_data,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

#stm inputs 
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

I then used wordStem function to create a new column with stemmed words and further on I went ahead and performed a stem count.

#stemming 
stemmed_articles <- journalarticles_data %>%
  unnest_tokens(output = word, input = title) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))
stemmed_articles

## # A tibble: 8,291 x 4
##    id    abstract                                                word     stem  
##    <chr> <chr>                                                   <chr>    <chr> 
##  1 21449 "A filter for universal real-time prediction of band-l~ univers~ unive~
##  2 21449 "A filter for universal real-time prediction of band-l~ negative neg   
##  3 21449 "A filter for universal real-time prediction of band-l~ delay    delai 
##  4 21449 "A filter for universal real-time prediction of band-l~ filter   filter
##  5 21449 "A filter for universal real-time prediction of band-l~ predict~ predi~
##  6 21449 "A filter for universal real-time prediction of band-l~ band     band  
##  7 21449 "A filter for universal real-time prediction of band-l~ limited  limit 
##  8 21449 "A filter for universal real-time prediction of band-l~ signals  signal
##  9 21555 "In this article we describe vector bundles over proje~ vector   vector
## 10 21555 "In this article we describe vector bundles over proje~ bundles  bundl 
## # ... with 8,281 more rows

## <<DocumentTermMatrix (documents: 1148, terms: 2805)>>
## Non-/sparse entries: 8121/3212019
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

## <<DocumentTermMatrix (documents: 1148, terms: 3566)>>
## Non-/sparse entries: 8162/4085606
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

## # A tibble: 2,805 x 2
##    stem        n
##    <chr>   <int>
##  1 model     106
##  2 network   104
##  3 learn      94
##  4 base       65
##  5 neural     50
##  6 data       49
##  7 optim      47
##  8 estim      45
##  9 analysi    43
## 10 time       42
## # ... with 2,795 more rows

After performing the count, it can be observed that the most prevalent stems are “model”, “neural” “network” and “learn”. The results are not very different from the primary word count that was performed. ,

3. MODEL

In the modeling phase of this study, I deploy both the LDA and Stm algorithms to investigate patterns in the provided research titles.

In selecting the value for K, I decided to go with 6, which represents the number of distinct themes for the research articles provided by the source. In revisiting the key intention of the study is to examine how efficient topic modelling is in classifying topics and how the identified topics are in alignment with the source themes. The provided themes are computer science, Physics, Mathematics, Statistics, Quantitative Biology and Quantitative Finance.

articles_lda <- LDA(articles_dtm, 
                  k = 6, 
                  control = list(seed = 200)
                  )
articles_lda

## A LDA_VEM topic model with 6 topics.

Fitting a Structural Topic Model

#extracting elements from the temp object
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab

articles_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab,
         K=6,
         max.em.its=5,
         verbose = FALSE)
articles_stm

## A topic model with 6 topics, 1148 documents and a 3014 word dictionary.

As noted earlier, the stm package has a number of handy features. One of these is the plot.STM() function for viewing the most probable words assigned to each topic.

By default, it only shows the first 3 terms so let’s change that to 5 to help with interpretation:

plot.STM(articles_stm, n = 5)

plot.STM(articles_stm, n = 5)

k_metrics <- FindTopicsNumber(
  articles_dtm,
  topics = seq(10, 75, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)
FindTopicsNumber_plot(k_metrics)

I also used the LDAvis topic browser to explore the distribution of the emergent theme words.

toLDAvis(mod = articles_stm, docs = docs)

## Loading required namespace: servr

There seem to be an overlapping case for topics 1 and 5 as well as 3 and 2. This can potentially be an indication that selecting K=6 was not an optimal choice.

4. EXPLORE

In the explore phase I examined β and looked at the probabilities of words belonging to the identified topics. This stage was very useful in trying to further make sense of the topics as my intention was to make connections to the pre-provided themes from the source dataset. I therefore explored the top 5 words that were assigned to each topic and made further interpretations based on them.

terms(articles_lda, 5)

##      Topic 1     Topic 2      Topic 3     Topic 4       Topic 5   
## [1,] "1"         "estimation" "optimal"   "linear"      "networks"
## [2,] "approach"  "based"      "based"     "data"        "based"   
## [3,] "analysis"  "effect"     "models"    "quantum"     "network" 
## [4,] "equations" "model"      "model"     "functions"   "time"    
## [5,] "2"         "random"     "inference" "topological" "learning"
##      Topic 6         
## [1,] "learning"      
## [2,] "deep"          
## [3,] "neural"        
## [4,] "networks"      
## [5,] "classification"

tidy_lda <- tidy(articles_lda)
tidy_lda

## # A tibble: 21,396 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 dependence 3.60e- 73
##  2     2 dependence 4.29e-  3
##  3     3 dependence 7.46e- 54
##  4     4 dependence 2.11e-  5
##  5     5 dependence 4.38e- 21
##  6     6 dependence 1.05e- 22
##  7     1 dissecting 2.10e-295
##  8     2 dissecting 7.18e-  4
##  9     3 dissecting 1.88e-295
## 10     4 dissecting 1.03e-296
## # ... with 21,386 more rows

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

td_beta <- tidy(articles_lda)
td_gamma <- tidy(articles_lda, matrix = "gamma")
td_beta

## # A tibble: 21,396 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 dependence 3.60e- 73
##  2     2 dependence 4.29e-  3
##  3     3 dependence 7.46e- 54
##  4     4 dependence 2.11e-  5
##  5     5 dependence 4.38e- 21
##  6     6 dependence 1.05e- 22
##  7     1 dissecting 2.10e-295
##  8     2 dissecting 7.18e-  4
##  9     3 dissecting 1.88e-295
## 10     4 dissecting 1.03e-296
## # ... with 21,386 more rows

td_gamma

## # A tibble: 6,888 x 3
##    document topic   gamma
##    <chr>    <int>   <dbl>
##  1 20979        1 0.00888
##  2 20995        1 0.00888
##  3 21005        1 0.788  
##  4 21012        1 0.880  
##  5 21018        1 0.00544
##  6 21028        1 0.0105 
##  7 21034        1 0.970  
##  8 21039        1 0.00603
##  9 21040        1 0.0105 
## 10 21063        1 0.00675
## # ... with 6,878 more rows

top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))
gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))

Topic	Expected topic proportion	Top 7 terms
Topic 5	0.180	networks, based, network, time, learning, optimization, data
Topic 6	0.173	learning, deep, neural, networks, classification, machine, matter
Topic 4	0.168	linear, data, quantum, functions, topological, spin, statistical
Topic 2	0.164	estimation, based, effect, model, random, layer, density
Topic 3	0.161	optimal, based, models, model, inference, gaussian, multiple
Topic 1	0.154	1, approach, analysis, equations, 2, solutions, performance

5. COMMUNICATE

The independent analysis had the primary intention of examining the efficiency of topic modeling in classifying research articles based on the provided titles. The findings reveal that while some topics were straight forward and easier to interpret compared to others which needed connections to context in order to make sense out of them. Topic 1 was providing insights on approaches, analyses and equations, which can be interpreted as a theme that is based on Mathematics. This is in line with one of the pre-provided themes of the articles which is Mathematics. Topic 3 had emerging themes containing words with “models” and “inferences” which can be aligned with topics related to statistics. This is also in alignment with one of the pre-provided themes from the source corpus. Additionally, Topic 4 was also straightforward as it had words such as “linear”, “quantum” and “topological”. With context, these words can be linked with topic related to Physics which again, it is part of the source themes provided by the authors. However, apart from these four topics that were direct and insightful to make interpretations, I found topic 2 to be slightly ambiguous. In this case, the process of sense making involved making connections to the context and for this case I would assume it could be a classification of topics that are linked to Quantitative Finances. However based on the LDA explorer, I noticed that Topic 2 was overlapping with Topic 3, which by the look of the emergent words, it can be challenging to make insightful interpretations that would make a distinction of these two topics.This issue of crosscutting themes has also been observed in a similar study by Muchene et al.(2021) and it is to be expected in multi-disciplinary collections.

To conclude, based on the findings , topic modeling presents a great potential of simplifying classification of journal articles. LDA as an unsupervised algorithm performed well in synthesizing the topics according to the prevalent themes. Four out of six emergent topics were aligned to the source themes.I noticed that in these latent topics, the use of domain specific terms such as “quantum” or “neural” helped in classifying the text. The other two topics (5 and 2) needed further contextual sense making based on the familiarity of the corpus.

Limitations and Furtherwork The main limitation on my end was the computational power of my machine. This is the main reason for mining topics from the research titles instead of abstracts. I believe the modeling processes would have revealed more robust results if more context rich corpus was used.

References Chaitanya, V., & Singh, P. K. (2017, November). Research articles suggestion using topic modelling. In 2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI) (pp. 178-182). IEEE.

Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school: A collaborative approach to improving education. Routledge.

Muchene, L., & Safari, W. (2021). Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya. PloS one, 16(1), e0243208.