1 Introduction

Evaluation and interpretation of scientific productions can be so helpful in determination of prominent authors, active departments and hot topics in a specific field. This process is called scientometrics and bibliometrix. Scientometrics refers to “all quantitative aspects science and scientific research” (Sengupta 1992). On the other hand, Bibliometrics refers to “the application of mathematics and statistical methods to books and other forms of written communication” (Pritchard 1969). Visualization and Statistical methods of these published documents can be analyzed using R bibliometrix package. This package is created and developed by [Massimo Aria] (https://masimoaria.com) and [Corrado Coccurullo] (https://www.corradococcurullo.com).

Our purpose is the investigation of RStudio applications of published scientific papers of PubMed database. In order to, we used bibliometrix package in Rstudio.

2 Call required packages

First of all we should install and call all of required packages for our analysis.

install.packages("bibliometrix")
install.packages("kableExtra")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("reshape2")
install.packages("pubmedR")

options(scipen = 999)
library(bibliometrix)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(reshape2)
library(pubmedR)

3 Search strategy

The search strategy should be considered based on a predefined search text

An API key is required to better and faster searches.Nevertheless, NULL can be put instead of a specific value.

As, RStudio and R programming language are interchangeably used, then both of them are considered in search strategy. After determining the search strategy we searched and finally found 628 documents based on searched terms including articles, book chapters, conference papers and etc.

Now, this data set is necessary to be converted to data frame for statistical analyses. In order to, the following commands are used.

D <- pmApiRequest(query = query, res$total_count, api_key = NULL)

## Documents  200  of  628 
## Documents  400  of  628 
## Documents  600  of  628 
## Documents  628  of  628

M <- pmApi2df(D)

## ================================================================================

M <- convert2df(D, dbsource = "pubmed", format = "api")

## 
## Converting your pubmed collection into a bibliographic dataframe
## 
## ================================================================================
## Done!

Now we use the following commands to get an overview of the data. This information can be gattered in a table suitable for html files. Some attributes like cell positions, cell alignment and so on can be set with different arguments of kable function.

results <- biblioAnalysis(M)

Sometimes, researchers may prefer TO do their analysis in a specific type of document, as only articles. On the other hand, since Rstudio company have been found in 2011, searches are limited to after 2011.

M <- filter(M, M$DT == "JOURNAL ARTICLE" & M$PY >= 2011)
results <- biblioAnalysis(M)

a <- summary(results)

knitr::kable(a$MainInformationDF, caption = "Main information of articles",align = "llccl", format = "html") %>% 
    kable_classic(full_width = F, position = "center")

Main information of articles
Description	Results
MAIN INFORMATION ABOUT DATA
Timespan	2011:2022
Sources (Journals, Books, etc)	402
Documents	587
Annual Growth Rate %	34.16
Document Average Age	2.61
Average citations per doc	0
Average citations per year per doc	0
References	1
DOCUMENT TYPES
journal article	587
DOCUMENT CONTENTS
Keywords Plus (ID)	1382
Author’s Keywords (DE)	1887
AUTHORS
Authors	3058
Author Appearances	3460
Authors of single-authored docs	22
AUTHORS COLLABORATION
Single-authored docs	23
Documents per Author	0.192
Co-Authors per Doc	5.89
International co-authorships %	0

4 General information about scientific documents

Here, we can look at some tables and plots which are distracted from data set based on our search strategy.

4.1 Scientific documents production year by year

knitr::kable(a$AnnualProduction, caption = 
           "Annualy Production for scientific Documents",
           align = "cc", format = "html") %>%
           kable_classic(full_width = F, position = "center")

Annualy Production for scientific Documents
Year	Articles
2011	3
2012	6
2013	4
2014	21
2015	19
2016	24
2017	35
2018	34
2019	72
2020	139
2021	154
2022	76

Based on this table, it seems that number of published documents has increased in recent years. Maybe because of Covid-19 pandemi.

4.2 Top 10 Authors of PubMed papers analyzed by Rstudio

Lets take a look at top 10 authors and some indexes like number of articles and articles fractionalized.

knitr::kable(a$MostProdAuthors, caption = "Top 10 Authors", align = "lclc", format = "html") %>% kable_classic(full_width = F, position = "center")

Top 10 Authors
Authors	Articles	Authors	Articles Fractionalized
WANG Z	9	OH KK	3.25
LIU Y	8	ADNAN M	2.25
OH KK	8	CHO DH	2.25
WANG Y	8	HU K	2.25
ADNAN M	7	TENAN MS	1.50
CHO DH	7	WANG Y	1.40
WANG C	7	LIU Y	1.20
WANG H	7	YANG J	1.17
XU Y	7	WANG Z	1.08
ZHANG X	7	WANG X	1.04

4.3 Top 10 most cited papers of PubMed papers analyzed by Rstudio

knitr::kable(a$MostCitedPapers[,1:2], caption = "10 Most Cited Papers",  
             align = "ll",format = "html") %>%  
    kable_classic(full_width = F, position = "center")

10 Most Cited Papers
Paper	DOI
HAILE TG, 2022, INT HEALTH	10.1093/inthealth/ihac060
LORTIE CJ, 2022, ECOL EVOL	10.1002/ece3.9245
ZHANG D, 2022, FRONT ONCOL	10.3389/fonc.2022.978427
MA K, 2022, HELIYON	10.1016/j.heliyon.2022.e10298
ISLAM MA, 2022, ANTIBIOTICS (BASEL)	10.3390/antibiotics11081012
RIPON RK, 2022, PLOS ONE	10.1371/journal.pone.0272905
MA K, 2022, J MOL NEUROSCI	10.1007/s12031-022-02060-4
PERGIALIOTIS V, 2022, CURR ONCOL	10.3390/curroncol29080454
LETH MF, 2022, ACTA ANAESTHESIOL SCAND	10.1111/aas.14140
HYZY M, 2022, JMIR MHEALTH UHEALTH	10.2196/37290

10 most cited papers are seen in the following table.

4.4 Top 10 journals in of PubMed papers analyzed by Rstudio

Here, we can see top 10 journals based on their frequency of published documents. Apparently, PLOS ONE and BIOINFORMATICS are prominent in this field.

knitr::kable(a$MostRelSources, caption = "Top 10 Journals", align =       
        "lc", format = "html") %>% kable_classic(full_width = F, position = "center")

Top 10 Journals
Sources	Articles
PLOS ONE	21
BIOINFORMATICS (OXFORD ENGLAND)	19
METHODS IN MOLECULAR BIOLOGY (CLIFTON N.J.)	11
BMC BIOINFORMATICS	8
F1000RESEARCH	8
BIOMED RESEARCH INTERNATIONAL	7
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH	7
CUREUS	6
DATA IN BRIEF	6
FRONTIERS IN GENETICS	6

4.5 Top 10 keywords: DE and ID of PubMed papers analyzed by Rstudio

Here, we can see top 10 keywords based on their frequency of published documents. important We should notice that there are two types of keywords which we investigate them separately. DEs are keywords extracted from article. IDs are keywords of references of articles.

knitr::kable(a$MostRelKeywords, caption = "Top 10 Keywords", align = "lclc", format = "html") %>%
     add_footnote(c("DE: Keywords Extracted from Articles","ID: Keywords
     Extracted from References of Articles"), notation="alphabet") %>%  
     kable_classic(position = "center")

Top 10 Keywords
Author Keywords (DE)	Articles	Keywords-Plus (ID)	Articles
META-ANALYSIS	19	HUMANS	307
COVID-19	18	FEMALE	89
R PROGRAMMING LANGUAGE	18	MALE	84
R	17	SOFTWARE	82
RSTUDIO	15	ADULT	53
PROGNOSIS	13	MIDDLE AGED	47
MACHINE LEARNING	11	AGED	41
BIOINFORMATICS	9	COMPUTATIONAL BIOLOGY	41
CANCER	9	ANIMALS	37
BIBLIOMETRIC	7	RETROSPECTIVE STUDIES	36
^a DE: Keywords Extracted from Articles
^b ID: Keywords Extracted from References of Articles

4.6 Top 10 authors and their timeline production

res <- authorProdOverTime(M, k=10)

knitr::kable(res$dfAU[1:3], caption = "Top 10 authors and their timeline,as well annually production, total citations, total citations per year", format = "html") %>% kable_classic(position = "center")

Top 10 authors and their timeline,as well annually production, total citations, total citations per year
Author	year	freq
ADNAN M	2020	4
ADNAN M	2021	1
ADNAN M	2022	2
CHO DH	2020	4
CHO DH	2021	1
CHO DH	2022	2
LIU Y	2017	1
LIU Y	2020	5
LIU Y	2021	2
OH KK	2020	4
OH KK	2021	2
OH KK	2022	2
WANG C	2015	1
WANG C	2019	2
WANG C	2020	1
WANG C	2021	3
WANG H	2014	1
WANG H	2019	1
WANG H	2020	1
WANG H	2021	3
WANG H	2022	1
WANG Y	2017	1
WANG Y	2019	1
WANG Y	2020	3
WANG Y	2021	2
WANG Y	2022	1
WANG Z	2019	1
WANG Z	2020	2
WANG Z	2021	4
WANG Z	2022	2
XU Y	2016	1
XU Y	2017	1
XU Y	2019	1
XU Y	2020	2
XU Y	2021	2
ZHANG X	2020	4
ZHANG X	2021	2
ZHANG X	2022	1

4.7 Timeline production of best journal of PubMed papers analyzed by Rstudio

topSO=sourceGrowth(M, top=1, cdf=FALSE)
DF=melt(topSO, id='Year')
ggplot(DF,aes(Year,value, group=variable, color=variable))+geom_line()

topSO=sourceGrowth(M, top=3, cdf=FALSE)
DF=melt(topSO, id='Year')

4.8 Some Information about Top 10 Authors

DF=dominance(results)
knitr::kable(DF, caption = "Some Information about Top 10 Authors", digits = 3, align = "lccccccc", format = "html") %>%
    kable_classic(position = "center")

Some Information about Top 10 Authors
Author	Dominance Factor	Tot Articles	Single-Authored	Multi-Authored	First-Authored	Rank by Articles	Rank by DF
OH KK	1.000	8	1	7	7	2	1
WANG X	0.333	6	0	6	2	9	2
WANG C	0.286	7	0	7	2	5	3
WANG H	0.286	7	0	7	2	5	3
LI H	0.200	5	0	5	1	10	5
XU Y	0.143	7	0	7	1	5	6
ZHANG X	0.143	7	0	7	1	5	6
LIU Y	0.125	8	0	8	1	2	8
WANG Y	0.125	8	0	8	1	2	8
WANG Z	0.111	9	0	9	1	1	10

Dominance factor indicates the ratio of first authored papers to total of articles for top 10 authors.

4.9 Top countries based on frequency of publications in their journals

knitr::kable(head(sort(table(M$SO_CO),decreasing=TRUE),10), caption = "Top Countries based on Frequency of published articles in Journals", col.names =         c("Country", "Frequency"), align = "lc", format = "html") %>%
        kable_classic(full_width = F, position = "center" )

Top Countries based on Frequency of published articles in Journals
Country	Frequency
UNITED STATES	194
ENGLAND	165
SWITZERLAND	76
NETHERLANDS	42
GERMANY	14
CANADA	13
CHINA	10
BRAZIL	7
NEW ZEALAND	7
GREECE	6

As can be seen, United states and England are two prominent countries based on publishing articles.

5 What does say Lotka’s Law us about these data set?

L=lotka(results)
lotkaTable=cbind(L$AuthorProd[,1],L$AuthorProd[,2],L$AuthorProd[,3],L$fitted)
knitr::kable(lotkaTable, caption = "Frequency Of Authors Based on Lotka's Law", digits = 3, align = "cccc", format = "html",col.names = c("Number of article", "Number of authors", "Frequency based on data", "Frequency based on Lotka's law")) %>%
    kable_classic(full_width = F, position = "center")

Frequency Of Authors Based on Lotka’s Law
Number of article	Number of authors	Frequency based on data	Frequency based on Lotka’s law
1	2809	0.919	0.644
2	179	0.059	0.063
3	36	0.012	0.016
4	13	0.004	0.006
5	8	0.003	0.003
6	3	0.001	0.002
7	6	0.002	0.001
8	3	0.001	0.001
9	1	0.000	0.000

Pvalue of two-sample Kolmogorov-Smirnov test between the frequency based on data and the Lotka’s Law is 0.0366311. In significance level of 0.05, this value says us that our data do not follow Lotka’s law.

6 Collaboration networks for authors

Collaboration network of authors are plotted. As well, the network can be plotted for keywords, universities and countries.

NetMatrix <- biblioNetwork(M, analysis = "collaboration", 
                         network = "authors", sep = ";")
net <- networkPlot(NetMatrix, n = 10, type = "auto", Title = "collaboration Network",labelsize=1, halo = TRUE)

7 Thematic map

Thematic Maps are plotted based on (keywords) DE AS follows:

remove.terms.1word = c("aged","map","allergy","demand","rest","workflow","data collection","r","rstudio","data analysis","conservation","review","functional",
    "clinical","identification","data","analysis","network","systematic","r programming","r package","maternal","reproducibility","r language","methods","treatment","r programming language","sars-cov-2","retention","calcium","statistics","open source","quality","methodology","complications","statistical analysis","prognosis","algorithms","software")

synonyms1 <- c("covid-19;coronavirus","gene; genes", "prediction; predicting", "modeling; modelling; resting","emotion; emotional", "adhd; hyperactivity",
      "differentially expressed genes;differentially expressed")
tm1 = thematicMap(M, field = "DE",n.labels = 2, ngrams = 1, remove.terms = remove.terms.1word,synonyms = synonyms1)
plot(tm1$map)

Thematic map is a plot which has been divided to four quadrant: Niche Themes, Motor Themes, Basic Themes and Emerging or declining Themes. For more details refer to (Zhang et al. 2022).

Motor Themes: Quadrant I, located in the upper-right quadrant, named motor
themes, suggested that the themes of the quadrant have developed
and formed important pillars that shape the field of research.

Niche Themes: Quadrant II, located in the upper left quadrant, named niche themes, reflected highly developed but isolated themes.

Emerging or declining Themes: Quadrant III, located in the lower-left
quadrant and named emerging or declining themes, suggested weak development and marginalization of the research field.

Basic Themes: Quadrant IV, located in the lower-right quadrant, was named as basic themes. Although these topics are less developed, they are important to the field of study.

Some diseases (motor themes), like obesity, covid_19, schizophernia, cancer, tuberculosis are discussed well and highly developed and analyzed by Rstudio. on the other side, some statistical and analytical topics such as machine learning, pca (principal component analysis), bibliometrics and bioinformatic analysis.

some diseases (basic themes) like type 2 diabetes mellitus, stroke, differentially expressed genes need to be considered and analyzed more than the present by rstudio, as well as meta analysis, systematic review, network analysis and computational analysis are methods which is reccommedned to use.

Based on this map, there are some themes which have been over discussed (topics covered by niche themes quadrant) in PubMed database. Topics such as, natural language processing, text mining, Pan-Cancer, behavioral science etc. As well, themes in quadrant lll, for example visualization and shiny are of declining themes.

Explanation: some words which we don’t want to be included in the map, as well synonym words are predefined.

8 Associations among our information

Here, we can see association among Authors, DEs and Journals.

threeFieldsPlot(M)

This plot shows how keywords, authors and journals are related to each other.

9 Thematic Evolution Plot

Here, We can See Evolution of Topics in RStudio applications field based on DE and TI.

This plot shows themes which have been evolutted during the years.

years=c(2019)

nexus <- thematicEvolution(M,field="DE",years=years,n=100,
          minFreq=3, ngrams = 1,remove.terms = remove.terms.1word,
          synonyms = synonyms1)
plotThematicEvolution(nexus$Nodes,nexus$Edges)

nexus <- thematicEvolution(M,field="TI",years=years,n=100,
          minFreq=3, ngrams = 2,remove.terms = remove.terms.1word,
          synonyms = synonyms1)

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [505].

plotThematicEvolution(nexus$Nodes,nexus$Edges)

References:

Pritchard, A. 1969. “Statistical Bibliography or Bibliometrics.” Undefined. https://www.semanticscholar.org/me/library/all.

Sengupta, I. N. 1992. “Bibliometrics, Informetrics, Scientometrics and Librametrics: An Overview” 42 (2): 75–98. https://doi.org/10.1515/libr.1992.42.2.75.

Zhang, Mingjie, Xiaoxue Wang, Xueting Chen, Zixuan Song, Yuting Wang, Yangzi Zhou, and Dandan Zhang. 2022. “A Scientometric Analysis and Visualization Discovery of Enhanced Recovery After Surgery.” Frontiers in Surgery 9. https://www.frontiersin.org/articles/10.3389/fsurg.2022.894083.

Scientometrics Analysis of RStudio applications in PubMed Database

Reza AA Khoei

2022-09-05