1 Introduction

Evaluation and interpretation of scientific productions can be so helpful in determination of prominent authors, active departments and hot topics in a specific field. This process is called scientometrics and bibliometrix. Scientometrics refers to “all quantitative aspects science and scientific research” (Sengupta 1992). On the other hand, Bibliometrics refers to “the application of mathematics and statistical methods to books and other forms of written communication” (Pritchard 1969). Visualization and Statistical methods of these published documents can be analyzed using R bibliometrix package. This package is created and developed by [Massimo Aria] (https://masimoaria.com) and [Corrado Coccurullo] (https://www.corradococcurullo.com).

Our purpose is the investigation of RStudio applications of published scientific papers of PubMed database. In order to, we used bibliometrix package in Rstudio.

2 Call required packages

First of all we should install and call all of required packages for our analysis.

install.packages("bibliometrix")
install.packages("kableExtra")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("reshape2")
install.packages("pubmedR")
options(scipen = 999)
library(bibliometrix)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(reshape2)
library(pubmedR)

3 Search strategy

The search strategy should be considered based on a predefined search text

An API key is required to better and faster searches.Nevertheless, NULL can be put instead of a specific value.

As, RStudio and R programming language are interchangeably used, then both of them are considered in search strategy. After determining the search strategy we searched and finally found 628 documents based on searched terms including articles, book chapters, conference papers and etc.

Now, this data set is necessary to be converted to data frame for statistical analyses. In order to, the following commands are used.

D <- pmApiRequest(query = query, res$total_count, api_key = NULL)
## Documents  200  of  628 
## Documents  400  of  628 
## Documents  600  of  628 
## Documents  628  of  628
M <- pmApi2df(D)
## ================================================================================
M <- convert2df(D, dbsource = "pubmed", format = "api")
## 
## Converting your pubmed collection into a bibliographic dataframe
## 
## ================================================================================
## Done!

Now we use the following commands to get an overview of the data. This information can be gattered in a table suitable for html files. Some attributes like cell positions, cell alignment and so on can be set with different arguments of kable function.

results <- biblioAnalysis(M)

Sometimes, researchers may prefer TO do their analysis in a specific type of document, as only articles. On the other hand, since Rstudio company have been found in 2011, searches are limited to after 2011.

M <- filter(M, M$DT == "JOURNAL ARTICLE" & M$PY >= 2011)
results <- biblioAnalysis(M)
a <- summary(results)
knitr::kable(a$MainInformationDF, caption = "Main information of articles",align = "llccl", format = "html") %>% 
    kable_classic(full_width = F, position = "center")
Main information of articles
Description Results
MAIN INFORMATION ABOUT DATA
Timespan 2011:2022
Sources (Journals, Books, etc) 402
Documents 587
Annual Growth Rate % 34.16
Document Average Age 2.61
Average citations per doc 0
Average citations per year per doc 0
References 1
DOCUMENT TYPES
journal article 587
DOCUMENT CONTENTS
Keywords Plus (ID) 1382
Author’s Keywords (DE) 1887
AUTHORS
Authors 3058
Author Appearances 3460
Authors of single-authored docs 22
AUTHORS COLLABORATION
Single-authored docs 23
Documents per Author 0.192
Co-Authors per Doc 5.89
International co-authorships % 0

4 General information about scientific documents

Here, we can look at some tables and plots which are distracted from data set based on our search strategy.

4.1 Scientific documents production year by year

knitr::kable(a$AnnualProduction, caption = 
           "Annualy Production for scientific Documents",
           align = "cc", format = "html") %>%
           kable_classic(full_width = F, position = "center")
Annualy Production for scientific Documents
Year Articles
2011 3
2012 6
2013 4
2014 21
2015 19
2016 24
2017 35
2018 34
2019 72
2020 139
2021 154
2022 76

Based on this table, it seems that number of published documents has increased in recent years. Maybe because of Covid-19 pandemi.

4.2 Top 10 Authors of PubMed papers analyzed by Rstudio

Lets take a look at top 10 authors and some indexes like number of articles and articles fractionalized.

knitr::kable(a$MostProdAuthors, caption = "Top 10 Authors", align = "lclc", format = "html") %>% kable_classic(full_width = F, position = "center")
Top 10 Authors
Authors Articles Authors Articles Fractionalized
WANG Z 9 OH KK 3.25
LIU Y 8 ADNAN M 2.25
OH KK 8 CHO DH 2.25
WANG Y 8 HU K 2.25
ADNAN M 7 TENAN MS 1.50
CHO DH 7 WANG Y 1.40
WANG C 7 LIU Y 1.20
WANG H 7 YANG J 1.17
XU Y 7 WANG Z 1.08
ZHANG X 7 WANG X 1.04

4.3 Top 10 most cited papers of PubMed papers analyzed by Rstudio

knitr::kable(a$MostCitedPapers[,1:2], caption = "10 Most Cited Papers",  
             align = "ll",format = "html") %>%  
    kable_classic(full_width = F, position = "center")
10 Most Cited Papers
Paper DOI
HAILE TG, 2022, INT HEALTH 10.1093/inthealth/ihac060
LORTIE CJ, 2022, ECOL EVOL 10.1002/ece3.9245
ZHANG D, 2022, FRONT ONCOL 10.3389/fonc.2022.978427
MA K, 2022, HELIYON 10.1016/j.heliyon.2022.e10298
ISLAM MA, 2022, ANTIBIOTICS (BASEL) 10.3390/antibiotics11081012
RIPON RK, 2022, PLOS ONE 10.1371/journal.pone.0272905
MA K, 2022, J MOL NEUROSCI 10.1007/s12031-022-02060-4
PERGIALIOTIS V, 2022, CURR ONCOL 10.3390/curroncol29080454
LETH MF, 2022, ACTA ANAESTHESIOL SCAND 10.1111/aas.14140
HYZY M, 2022, JMIR MHEALTH UHEALTH 10.2196/37290

10 most cited papers are seen in the following table.

4.4 Top 10 journals in of PubMed papers analyzed by Rstudio

Here, we can see top 10 journals based on their frequency of published documents. Apparently, PLOS ONE and BIOINFORMATICS are prominent in this field.

knitr::kable(a$MostRelSources, caption = "Top 10 Journals", align =       
        "lc", format = "html") %>% kable_classic(full_width = F, position = "center")
Top 10 Journals
Sources Articles
PLOS ONE 21
BIOINFORMATICS (OXFORD ENGLAND) 19
METHODS IN MOLECULAR BIOLOGY (CLIFTON N.J.) 11
BMC BIOINFORMATICS 8
F1000RESEARCH 8
BIOMED RESEARCH INTERNATIONAL 7
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 7
CUREUS 6
DATA IN BRIEF 6
FRONTIERS IN GENETICS 6

4.5 Top 10 keywords: DE and ID of PubMed papers analyzed by Rstudio

Here, we can see top 10 keywords based on their frequency of published documents. important We should notice that there are two types of keywords which we investigate them separately. DEs are keywords extracted from article. IDs are keywords of references of articles.

knitr::kable(a$MostRelKeywords, caption = "Top 10 Keywords", align = "lclc", format = "html") %>%
     add_footnote(c("DE: Keywords Extracted from Articles","ID: Keywords
     Extracted from References of Articles"), notation="alphabet") %>%  
     kable_classic(position = "center")
Top 10 Keywords
Author Keywords (DE) Articles Keywords-Plus (ID) Articles
META-ANALYSIS 19 HUMANS 307
COVID-19 18 FEMALE 89
R PROGRAMMING LANGUAGE 18 MALE 84
R 17 SOFTWARE 82
RSTUDIO 15 ADULT 53
PROGNOSIS 13 MIDDLE AGED 47
MACHINE LEARNING 11 AGED 41
BIOINFORMATICS 9 COMPUTATIONAL BIOLOGY 41
CANCER 9 ANIMALS 37
BIBLIOMETRIC 7 RETROSPECTIVE STUDIES 36
a DE: Keywords Extracted from Articles
b ID: Keywords
Extracted from References of Articles

4.6 Top 10 authors and their timeline production

res <- authorProdOverTime(M, k=10)

knitr::kable(res$dfAU[1:3], caption = "Top 10 authors and their timeline,as well annually production, total citations, total citations per year", format = "html") %>% kable_classic(position = "center")
Top 10 authors and their timeline,as well annually production, total citations, total citations per year
Author year freq
ADNAN M 2020 4
ADNAN M 2021 1
ADNAN M 2022 2
CHO DH 2020 4
CHO DH 2021 1
CHO DH 2022 2
LIU Y 2017 1
LIU Y 2020 5
LIU Y 2021 2
OH KK 2020 4
OH KK 2021 2
OH KK 2022 2
WANG C 2015 1
WANG C 2019 2
WANG C 2020 1
WANG C 2021 3
WANG H 2014 1
WANG H 2019 1
WANG H 2020 1
WANG H 2021 3
WANG H 2022 1
WANG Y 2017 1
WANG Y 2019 1
WANG Y 2020 3
WANG Y 2021 2
WANG Y 2022 1
WANG Z 2019 1
WANG Z 2020 2
WANG Z 2021 4
WANG Z 2022 2
XU Y 2016 1
XU Y 2017 1
XU Y 2019 1
XU Y 2020 2
XU Y 2021 2
ZHANG X 2020 4
ZHANG X 2021 2
ZHANG X 2022 1

4.7 Timeline production of best journal of PubMed papers analyzed by Rstudio

topSO=sourceGrowth(M, top=1, cdf=FALSE)
DF=melt(topSO, id='Year')
ggplot(DF,aes(Year,value, group=variable, color=variable))+geom_line()

topSO=sourceGrowth(M, top=3, cdf=FALSE)
DF=melt(topSO, id='Year')

4.8 Some Information about Top 10 Authors

DF=dominance(results)
knitr::kable(DF, caption = "Some Information about Top 10 Authors", digits = 3, align = "lccccccc", format = "html") %>%
    kable_classic(position = "center")       
Some Information about Top 10 Authors
Author Dominance Factor Tot Articles Single-Authored Multi-Authored First-Authored Rank by Articles Rank by DF
OH KK 1.000 8 1 7 7 2 1
WANG X 0.333 6 0 6 2 9 2
WANG C 0.286 7 0 7 2 5 3
WANG H 0.286 7 0 7 2 5 3
LI H 0.200 5 0 5 1 10 5
XU Y 0.143 7 0 7 1 5 6
ZHANG X 0.143 7 0 7 1 5 6
LIU Y 0.125 8 0 8 1 2 8
WANG Y 0.125 8 0 8 1 2 8
WANG Z 0.111 9 0 9 1 1 10

Dominance factor indicates the ratio of first authored papers to total of articles for top 10 authors.

4.9 Top countries based on frequency of publications in their journals

knitr::kable(head(sort(table(M$SO_CO),decreasing=TRUE),10), caption = "Top Countries based on Frequency of published articles in Journals", col.names =         c("Country", "Frequency"), align = "lc", format = "html") %>%
        kable_classic(full_width = F, position = "center" )
Top Countries based on Frequency of published articles in Journals
Country Frequency
UNITED STATES 194
ENGLAND 165
SWITZERLAND 76
NETHERLANDS 42
GERMANY 14
CANADA 13
CHINA 10
BRAZIL 7
NEW ZEALAND 7
GREECE 6

As can be seen, United states and England are two prominent countries based on publishing articles.

5 What does say Lotka’s Law us about these data set?

L=lotka(results)
lotkaTable=cbind(L$AuthorProd[,1],L$AuthorProd[,2],L$AuthorProd[,3],L$fitted)
knitr::kable(lotkaTable, caption = "Frequency Of Authors Based on Lotka's Law", digits = 3, align = "cccc", format = "html",col.names = c("Number of article", "Number of authors", "Frequency based on data", "Frequency based on Lotka's law")) %>%
    kable_classic(full_width = F, position = "center")
Frequency Of Authors Based on Lotka’s Law
Number of article Number of authors Frequency based on data Frequency based on Lotka’s law
1 2809 0.919 0.644
2 179 0.059 0.063
3 36 0.012 0.016
4 13 0.004 0.006
5 8 0.003 0.003
6 3 0.001 0.002
7 6 0.002 0.001
8 3 0.001 0.001
9 1 0.000 0.000

Pvalue of two-sample Kolmogorov-Smirnov test between the frequency based on data and the Lotka’s Law is 0.0366311. In significance level of 0.05, this value says us that our data do not follow Lotka’s law.

6 Collaboration networks for authors

Collaboration network of authors are plotted. As well, the network can be plotted for keywords, universities and countries.

NetMatrix <- biblioNetwork(M, analysis = "collaboration", 
                         network = "authors", sep = ";")
net <- networkPlot(NetMatrix, n = 10, type = "auto", Title = "collaboration Network",labelsize=1, halo = TRUE) 

7 Thematic map

Thematic Maps are plotted based on (keywords) DE AS follows:

remove.terms.1word = c("aged","map","allergy","demand","rest","workflow","data collection","r","rstudio","data analysis","conservation","review","functional",
    "clinical","identification","data","analysis","network","systematic","r programming","r package","maternal","reproducibility","r language","methods","treatment","r programming language","sars-cov-2","retention","calcium","statistics","open source","quality","methodology","complications","statistical analysis","prognosis","algorithms","software")

synonyms1 <- c("covid-19;coronavirus","gene; genes", "prediction; predicting", "modeling; modelling; resting","emotion; emotional", "adhd; hyperactivity",
      "differentially expressed genes;differentially expressed")
tm1 = thematicMap(M, field = "DE",n.labels = 2, ngrams = 1, remove.terms = remove.terms.1word,synonyms = synonyms1)
plot(tm1$map)

Thematic map is a plot which has been divided to four quadrant: Niche Themes, Motor Themes, Basic Themes and Emerging or declining Themes. For more details refer to (Zhang et al. 2022).

Motor Themes: Quadrant I, located in the upper-right quadrant, named motor
themes, suggested that the themes of the quadrant have developed
and formed important pillars that shape the field of research.

Niche Themes: Quadrant II, located in the upper left quadrant, named niche themes, reflected highly developed but isolated themes.

Emerging or declining Themes: Quadrant III, located in the lower-left
quadrant and named emerging or declining themes, suggested weak development and marginalization of the research field.

Basic Themes: Quadrant IV, located in the lower-right quadrant, was named as basic themes. Although these topics are less developed, they are important to the field of study.

Some diseases (motor themes), like obesity, covid_19, schizophernia, cancer, tuberculosis are discussed well and highly developed and analyzed by Rstudio. on the other side, some statistical and analytical topics such as machine learning, pca (principal component analysis), bibliometrics and bioinformatic analysis.

some diseases (basic themes) like type 2 diabetes mellitus, stroke, differentially expressed genes need to be considered and analyzed more than the present by rstudio, as well as meta analysis, systematic review, network analysis and computational analysis are methods which is reccommedned to use.

Based on this map, there are some themes which have been over discussed (topics covered by niche themes quadrant) in PubMed database. Topics such as, natural language processing, text mining, Pan-Cancer, behavioral science etc. As well, themes in quadrant lll, for example visualization and shiny are of declining themes.

Explanation: some words which we don’t want to be included in the map, as well synonym words are predefined.

8 Associations among our information

Here, we can see association among Authors, DEs and Journals.

threeFieldsPlot(M)

This plot shows how keywords, authors and journals are related to each other.

9 Thematic Evolution Plot

Here, We can See Evolution of Topics in RStudio applications field based on DE and TI.

This plot shows themes which have been evolutted during the years.

years=c(2019)

nexus <- thematicEvolution(M,field="DE",years=years,n=100,
          minFreq=3, ngrams = 1,remove.terms = remove.terms.1word,
          synonyms = synonyms1)
plotThematicEvolution(nexus$Nodes,nexus$Edges)
nexus <- thematicEvolution(M,field="TI",years=years,n=100,
          minFreq=3, ngrams = 2,remove.terms = remove.terms.1word,
          synonyms = synonyms1)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [505].
plotThematicEvolution(nexus$Nodes,nexus$Edges)

References:

Pritchard, A. 1969. “Statistical Bibliography or Bibliometrics.” Undefined. https://www.semanticscholar.org/me/library/all.
Sengupta, I. N. 1992. “Bibliometrics, Informetrics, Scientometrics and Librametrics: An Overview” 42 (2): 75–98. https://doi.org/10.1515/libr.1992.42.2.75.
Zhang, Mingjie, Xiaoxue Wang, Xueting Chen, Zixuan Song, Yuting Wang, Yangzi Zhou, and Dandan Zhang. 2022. “A Scientometric Analysis and Visualization Discovery of Enhanced Recovery After Surgery.” Frontiers in Surgery 9. https://www.frontiersin.org/articles/10.3389/fsurg.2022.894083.