The main objective of this independent analysis is to investigate how text network analysis can support the identification of keyword networks from collections of scientific interdisciplinary documents such as journal articles.The motivation of this analysis is based on empirical studies that have used text network analysis to identify co-occurrence words and further analyze research trends from academic papers. For instance, studies by Li et al. (2016) and Park(2019) utilized text network analysis to analyze co-keyword networks and keywords co- occurrence networks in academic articles in nursing and physics. Based on that context, I realized that a majority of these text network studies utilized papers and articles from distinct disciplines. For instance Park(2019) focused his analysis on nursing which also made the results very relevant distinct to that area. In this analysis, I am curious to take a different approach and further explore the interdisciplinary dataset from the previous analysis from Kaggle. The dataset hosts a collection of journal abstracts from six key areas which are Computer Science, Physics, Mathematics, Quantitative Biology and Quantitative Finance. In this analysis I will investigate whether text network analysis can provide insights on the prevalent keyword networks that are common in the six represented disciplines.
The main research question guiding this analysis is * How can bigrams support the understanding of trending keywords in a collection of interdisciplinary academic articles?
In this study, the dataset of interest is a collection of abstracts extracted from a corpus of journal articles from Kaggle. For the purpose of this study, I randomly selected a total of 1148 abstract for further analysis.
library(dplyr)
library(tidytext)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(igraph)
library(ggraph)
The first step in the wrangling process is to import the dataset into R environment.
journalabstracts <- read_csv("data/journalabstracts.csv")
## Rows: 1148 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): abstract
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset has a total of 1148 abstracts which will be the basis of text network analysis in this study.
Next, we will use some familiar tidytext functions for
tokenizing text. But this time, we will tokenize bigram, instead of
unigrams (or single words) in the sentiment analysis unit. We can also
tokenize trigrams etc.
#tokenizing text by bigrams
ja_bigrams <- journalabstracts %>%
unnest_tokens(bigram, abstract, token = "ngrams", n = 2)
The tokenization resulted into 180286 bigrams. The next process is to count and identify the most common bigrams.
#count bigrams
ja_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 97,280 × 2
## bigram n
## <chr> <int>
## 1 of the 1897
## 2 in the 865
## 3 to the 522
## 4 on the 496
## 5 in this 399
## 6 for the 398
## 7 that the 397
## 8 and the 381
## 9 of a 380
## 10 can be 351
## # … with 97,270 more rows
It can be noted that the common bigrams are rather combinations of stopwords and bring noise to the data. For this matter, the process of removing stopwords is important inorder to pare the dataset.
#separating bigrams inorder to remove stopwords
bigrams_separated <- ja_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 35,794 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 neural networks 75
## 2 neural network 74
## 3 machine learning 72
## 4 deep learning 48
## 5 real world 48
## 6 time series 37
## 7 dark matter 36
## 8 deep neural 35
## 9 real time 31
## 10 low rank 30
## # … with 35,784 more rows
#reuniting filtered words to reform bigrams
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 45,797 × 1
## bigram
## <chr>
## 1 universal real
## 2 real time
## 3 time prediction
## 4 band limited
## 5 limited signals
## 6 filter consists
## 7 multiple time
## 8 time delayed
## 9 delayed feedback
## 10 feedback terms
## # … with 45,787 more rows
# plotting the bigrams
bigrams_united %>%
count(bigram, sort = TRUE) %>%
top_n(20) %>%
ggplot(aes(bigram)) +
geom_bar(fill = "#de5833") +
theme_minimal() +
coord_flip()+
labs(title = "Top Bigrams of Interdisciplinary Journal Abstracts")
## Selecting by n
bigram_graph <- bigram_counts %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 74446e6 DN-- 11575 35794 --
## + attr: name (v/c), n (e/n)
## + edges from 74446e6 (vertex names):
## [1] neural ->networks neural ->network machine ->learning
## [4] deep ->learning real ->world time ->series
## [7] dark ->matter deep ->neural real ->time
## [10] low ->rank magnetic ->field monte ->carlo
## [13] proposed ->method convolutional->neural 1 ->2
## [16] data ->sets experimental ->results reinforcement->learning
## [19] wide ->range power ->law proposed ->approach
## [22] frac ->1 lower ->bound gradient ->descent
## + ... omitted several edges
bigram_graph_filtered <- bigram_counts %>%
filter(n > 10) %>%
graph_from_data_frame()
bigram_graph_filtered
## IGRAPH 5ee3fb5 DN-- 99 69 --
## + attr: name (v/c), n (e/n)
## + edges from 5ee3fb5 (vertex names):
## [1] neural ->networks neural ->network machine ->learning
## [4] deep ->learning real ->world time ->series
## [7] dark ->matter deep ->neural real ->time
## [10] low ->rank magnetic ->field monte ->carlo
## [13] proposed ->method convolutional->neural 1 ->2
## [16] data ->sets experimental ->results reinforcement->learning
## [19] wide ->range power ->law proposed ->approach
## [22] frac ->1 lower ->bound gradient ->descent
## + ... omitted several edges
set.seed(100)
a <- grid::arrow(type = "closed", length = unit(.2, "inches"))
ggraph(bigram_graph_filtered, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "#de5833", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
The main purpose of this independent analysis was to investigate the
use of text networks in identifying frequent interdisciplinary words
from a collection of scientific journal abstracts. The results indicate
that most of these bigram networks are for terminologies that are
analytics-based and they cross-cut between disciplines. Due to the big
volume of tokenized biagrams, which prior to filtering came to 45,797, I
made the decision to visualize bigrams that have occurred more than 10
times. Relationships are identified with bigrams such as “machine
learning”,“algorithms”, “experiments” which could indicate the prevalent
techniques and methodologies used for deep analysis. Another cluster
contains bigrams such as “real world”, “real time” and “time series”
which could be an indication of the nature of datasets that were used.
Furthermore there words such as “magnetic field”, “carlo monte”, “orbit”
and which are more related to Physics.
In conclusion, text network analysis presents great potential in
tracking research trends in journal articles and they can also support
holistic understanding of key points and their relationships.
Li, H., An, H., Wang, Y., Huang, J., & Gao, X. (2016). Evolutionary features of academic articles co-keyword network and keywords co-occurrence network: Based on two-mode affiliation network. Physica A: Statistical Mechanics and its Applications, 450, 657-669.