Ioannidis, John P.A. (2024), “August 2024 data-update for”Updated science-wide author databases of standardized citation indicators””, Elsevier Data Repository, V7, doi: 10.17632/btchxktzyw.7
Data Dictionary
“authfull”: Full name of the author.
“inst_name”: The name of the institution the author is affiliated with.
“cntry”: The country where the author’s institution is located.
“np6023”: Number of papers published by the author from 1960 to 2023.
“firstyr”: The first year the author published a paper.
“lastyr”: The most recent year the author published a paper.
“rank (ns)”: Author’s rank in terms of citations or papers within their subject or subfield (non-standardized).
“nc2323 (ns)”: Number of citations received by the author from 2023 (non-standardized).
“h23 (ns)”: The author’s H-index (non-standardized) as of 2023 (a measure that combines productivity and citation impact).
“hm23 (ns)”: The author’s H-metric (non-standardized) as of 2023 (similar to H-index but adjusted for author position in publications).
“nps (ns)”: Number of papers the author has published (non-standardized).
“ncs (ns)”: Number of citations the author has received (non-standardized).
“cpsf (ns)”: Citations per standardized fractional paper count (non-standardized).
“ncsf (ns)”: Number of citations per fractional paper (non-standardized).
“npsfl (ns)”: Number of papers with fractional authorship (non-standardized).
“ncsfl (ns)”: Number of citations for fractional papers (non-standardized).
“c (ns)”: Number of citations (non-standardized).
“npciting (ns)”: Number of citing papers (non-standardized).
“cprat (ns)”: Citation to publication ratio (non-standardized).
“np6023 cited2323 (ns)”: Number of papers from 1960 to 2023 that were cited in 2023 (non-standardized).
“self%”: Percentage of self-citations.
“rank”: Author’s rank in the dataset (standardized).
“nc2323”: Number of citations received by the author as of 2023 (standardized).
“h23”: Author’s H-index as of 2023 (standardized).
“hm23”: Author’s H-metric as of 2023 (standardized).
“nps”: Number of papers published by the author (standardized).
“ncs”: Number of citations received by the author (standardized).
“cpsf”: Citations per standardized fractional paper count (standardized).
“ncsf”: Number of citations per fractional paper (standardized).
“npsfl”: Number of fractional papers (standardized).
“ncsfl”: Number of citations for fractional papers (standardized).
“c”: Total number of citations (standardized).
“npciting”: Number of citing papers (standardized).
“cprat”: Citation to publication ratio (standardized).
“np6023 cited2323”: Number of papers from 1960 to 2023 that were cited in 2023 (standardized).
“np6023_rw”: Number of papers from 1960 to 2023 that were reweighted (a measure adjusting for factors like field or co-authorship).
“nc2323_to_rw”: Number of citations from 2023 reweighted.
“nc2323_rw”: Reweighted number of citations in 2023.
“sm-subfield-1”: Primary subfield of research for the author.
“sm-subfield-1-frac”: Fraction of work the author has published in the primary subfield.
“sm-subfield-2”: Secondary subfield of research for the author.
“sm-subfield-2-frac”: Fraction of work published in the secondary subfield.
“sm-field”: The broad field of research the author works in.
“sm-field-frac”: Fraction of work published in the broad field.
“rank sm-subfield-1”: Rank within the primary subfield (standardized).
“rank sm-subfield-1 (ns)”: Rank within the primary subfield (non-standardized).
“sm-subfield-1 count”: Number of papers in the primary subfield.
Data preparing
Normalize the names
Clean the dataset
Dimensions
x
1885
48
Institutions
Countries
Only active authors
Active authors
the institutions that cite themselves
# A tibble: 6 × 2
inst_name avg_self_citation
<chr> <dbl>
1 University of Pennsylvania Perelman School of Medicine 70.1
2 National Research Centre 60.9
3 Università degli Studi di Salerno 59.4
4 Dipartimento di Medicina, Chirurgia e Odontoiatria “Scuola … 55.0
5 Hôpital Henri Mondor 51.6
6 East Carolina University 51.0
# A tibble: 1 × 2
mean sd
<dbl> <dbl>
1 9.47 7.83
Correlation between productivity and self citation
Impact per institution
# A tibble: 429 × 2
inst_name mean_h_index
<chr> <dbl>
1 Vietnam National University, Hanoi 39
2 Università di Pisa 21
3 La Trobe University 20
4 Queen Mary University of London 19
5 Ludwig-Maximilians-Universität München 18.3
6 King Faisal University 18
7 National Taiwan University 18
8 Obafemi Awolowo University 18
9 Universidade Estadual de Maringá 18
10 The University of Queensland 17.6
# ℹ 419 more rows
Source Code
---title: "SCOPUS 2024"author: "Sergio Uribe - sergio.uribe@rsu.lv"date: 2024-09-19date-modified: last-modifiedlanguage: title-block-published: "CREATED" title-block-modified: "UPDATED"format: html: toc: truetoc-expand: 3code-fold: truecode-tools: trueeditor: visualexecute: echo: false cache: true warning: false message: false---```{r}# Load required libraries with pacman; installs them if not already installedpacman::p_load(tidyverse, # tools for data science# visdat, #NAs janitor, # for data cleaning and tables here, # for reproducible research gtsummary, # for tables countrycode, # to normalize country data# easystats, # check https://easystats.github.io/easystats/ scales, lubridate )```## DatasetIoannidis, John P.A. (2024), “August 2024 data-update for "Updated science-wide author databases of standardized citation indicators"”, Elsevier Data Repository, V7, doi: 10.17632/btchxktzyw.7```{r}df <-read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRUaeUNVQaq1A_uiKc9vKJTf3KdgsMrGkSaPWZBJYNsgBWs4Jv1i_20TEQqSqsT40OhUmmwXCjK9Bvc/pub?gid=1127167516&single=true&output=csv")```# Data Dictionary- **"authfull"**: Full name of the author.- **"inst_name"**: The name of the institution the author is affiliated with.- **"cntry"**: The country where the author's institution is located.- **"np6023"**: Number of papers published by the author from 1960 to 2023.- **"firstyr"**: The first year the author published a paper.- **"lastyr"**: The most recent year the author published a paper.- **"rank (ns)"**: Author's rank in terms of citations or papers within their subject or subfield (non-standardized).- **"nc2323 (ns)"**: Number of citations received by the author from 2023 (non-standardized).- **"h23 (ns)"**: The author’s **H-index** (non-standardized) as of 2023 (a measure that combines productivity and citation impact).- **"hm23 (ns)"**: The author’s **H-metric** (non-standardized) as of 2023 (similar to H-index but adjusted for author position in publications).- **"nps (ns)"**: Number of papers the author has published (non-standardized).- **"ncs (ns)"**: Number of citations the author has received (non-standardized).- **"cpsf (ns)"**: Citations per standardized fractional paper count (non-standardized).- **"ncsf (ns)"**: Number of citations per fractional paper (non-standardized).- **"npsfl (ns)"**: Number of papers with fractional authorship (non-standardized).- **"ncsfl (ns)"**: Number of citations for fractional papers (non-standardized).- **"c (ns)"**: Number of citations (non-standardized).- **"npciting (ns)"**: Number of citing papers (non-standardized).- **"cprat (ns)"**: Citation to publication ratio (non-standardized).- **"np6023 cited2323 (ns)"**: Number of papers from 1960 to 2023 that were cited in 2023 (non-standardized).- **"self%"**: Percentage of self-citations.- **"rank"**: Author's rank in the dataset (standardized).- **"nc2323"**: Number of citations received by the author as of 2023 (standardized).- **"h23"**: Author’s H-index as of 2023 (standardized).- **"hm23"**: Author’s H-metric as of 2023 (standardized).- **"nps"**: Number of papers published by the author (standardized).- **"ncs"**: Number of citations received by the author (standardized).- **"cpsf"**: Citations per standardized fractional paper count (standardized).- **"ncsf"**: Number of citations per fractional paper (standardized).- **"npsfl"**: Number of fractional papers (standardized).- **"ncsfl"**: Number of citations for fractional papers (standardized).- **"c"**: Total number of citations (standardized).- **"npciting"**: Number of citing papers (standardized).- **"cprat"**: Citation to publication ratio (standardized).- **"np6023 cited2323"**: Number of papers from 1960 to 2023 that were cited in 2023 (standardized).- **"np6023_rw"**: Number of papers from 1960 to 2023 that were reweighted (a measure adjusting for factors like field or co-authorship).- **"nc2323_to_rw"**: Number of citations from 2023 reweighted.- **"nc2323_rw"**: Reweighted number of citations in 2023.- **"sm-subfield-1"**: Primary subfield of research for the author.- **"sm-subfield-1-frac"**: Fraction of work the author has published in the primary subfield.- **"sm-subfield-2"**: Secondary subfield of research for the author.- **"sm-subfield-2-frac"**: Fraction of work published in the secondary subfield.- **"sm-field"**: The broad field of research the author works in.- **"sm-field-frac"**: Fraction of work published in the broad field.- **"rank sm-subfield-1"**: Rank within the primary subfield (standardized).- **"rank sm-subfield-1 (ns)"**: Rank within the primary subfield (non-standardized).- **"sm-subfield-1 count"**: Number of papers in the primary subfield.# Data preparingNormalize the names```{r}df <- df |> janitor::clean_names()```Clean the dataset```{r}df <- df %>%mutate(self_percent =str_trim(str_remove(self_percent, "%"))) %>%mutate(self_percent =as.numeric(self_percent))```## Dimensions```{r}# glimpse(df)dim(df) |> knitr::kable()```## Institutions```{r}df |>mutate(inst_name =fct_lump_prop(inst_name, prop = .005)) |>filter(inst_name !="School of Dentistry") |>filter(inst_name !="Other") |>group_by(inst_name) |>count() |>arrange(desc(n)) |>ggplot(aes(x =fct_reorder(inst_name, n), y = n)) +geom_col() +# scale_y_log10() +geom_text(aes(label = n), color ="white", hjust =1.2) +coord_flip() +labs(title ="Top Dental Institutions by number of scientists in Scopus Index", caption ="Source: Ioannidis, John P.A. (2024), “August 2024 data-update for \"Updated science-wide author databases\nof standardized citation indicators\"”, Elsevier Data Repository, V7, doi: 10.17632/btchxktzyw.7\n Figure by Sergio.Uribe@rsu.lv", x ="", y ="N of Scientists")```## Countries```{r}df |>mutate(cntry =fct_lump_prop(cntry, prop = .005)) |>filter(cntry !="School of Dentistry") |>filter(cntry !="Other") |>group_by(cntry) |>count() |>arrange(desc(n)) |>ggplot(aes(x =fct_reorder(cntry, n), y = n)) +geom_col() +# scale_y_log10() +geom_text(aes(label = n), color ="white", hjust =1.2) +coord_flip() +labs(title ="Top countries by number of dental scientists in Scopus Index", caption ="Source: Ioannidis, John P.A. (2024), “August 2024 data-update for \"Updated science-wide author databases\nof standardized citation indicators\"”, Elsevier Data Repository, V7, doi: 10.17632/btchxktzyw.7\n Figure by Sergio.Uribe@rsu.lv", x ="", y ="N of Scientists")```# Only active authors## Active authors```{r} df_active <- df %>%filter(firstyr >=1994& lastyr >=2024)``````{r}``````{r}df_active |>mutate(inst_name =fct_lump_prop(inst_name, prop = .005)) |>filter(inst_name !="School of Dentistry") |>filter(inst_name !="Other") |>group_by(inst_name) |>count() |>arrange(desc(n)) |>ggplot(aes(x =fct_reorder(inst_name, n), y = n)) +geom_col() +# scale_y_log10() +geom_text(aes(label = n), color ="white", hjust =1.2) +coord_flip() +labs(title ="Top Dental Institutions by number of scientists in Scopus Index,\n wfirst publication 1994 and last in 2023", caption ="Source: Ioannidis, John P.A. (2024), “August 2024 data-update for \"Updated science-wide author databases\nof standardized citation indicators\"”, Elsevier Data Repository, V7, doi: 10.17632/btchxktzyw.7\n Figure by Sergio.Uribe@rsu.lv", x ="", y ="N of Scientists")```## the institutions that cite themselves```{r}# Calculate the average self-citation percentage per institutiondf_active %>%group_by(inst_name) %>%summarise(avg_self_citation =mean(self_percent, na.rm =TRUE)) %>%filter(avg_self_citation >50) %>%arrange(desc(avg_self_citation))``````{r}df_active |>filter(self_percent >25.14) |>arrange(desc(self_percent)) |>ggplot(aes(x =fct_reorder(authfull, self_percent), y = self_percent)) +geom_jitter(width =0.2, height =0) +# Adding some width to the jittercoord_flip() +labs(title ="Authors with More Than 2SD self-citation percentage",subtitle ="(average = 9% SD 7.8)", x ="Author (Sorted by Self-Citation Percentage)",y ="Self-Citation Percentage",color ="Country" ) +theme_minimal() # Apply a clean theme``````{r}df |>summarise(mean =mean(self_percent), sd =sd(self_percent))```Correlation between productivity and self citation```{r}df_active |>ggplot(aes(x = self_percent, y = np6023)) +geom_point() +# scale_x_log10() + # scale_y_log10() + geom_smooth()``````{r}# By countrydf_active %>%group_by(cntry) %>%summarise(avg_self_citation =mean(self_percent, na.rm =TRUE)) %>%filter(avg_self_citation >20) %>%ggplot(aes(x =fct_reorder(cntry, avg_self_citation), y = avg_self_citation)) +geom_jitter(width =0.2, height =0) +coord_flip() +labs(title ="Countries with More Than 20% Self Citations",# subtitle = "(average = 9%, SD = 7.8)", x ="Country (Sorted by Average Self-Citation Percentage)",y ="Average Self-Citation Percentage",color ="Country" ) +theme_minimal()``````{r}# One-step calculation and plot for institutiondf_active %>%group_by(inst_name) %>%summarise(avg_self_citation =mean(self_percent, na.rm =TRUE)) %>%filter(avg_self_citation >30) %>%ggplot(aes(x =fct_reorder(inst_name, avg_self_citation), y = avg_self_citation)) +geom_jitter(width =0.2, height =0) +coord_flip() +labs(title ="Institutions with More 30% Self Citations",# subtitle = "(average = 9%, SD = 7.8)", x ="Institution (Sorted by Average Self-Citation Percentage)",y ="Average Self-Citation Percentage",color ="Institution" ) +theme_minimal()```## Impact per institution```{r}df_active %>%group_by(inst_name) %>%summarise(mean_h_index =mean(h23, na.rm =TRUE)) %>%arrange(desc(mean_h_index)) ```