In the previous part (rpubs.com) we selected 24 journals of Russian origin and investigated how their 2018-2019 contents is present in CrossRef, Scopus and the Lens. We found out that the coverage varies by journal, and a portion of publications that present in all three databases is near 75%.
This part is devoted to the author and affiliation information that has a wide application in scientometric analysis.
Scopus is known to extract an affiliation iformation as it is present in the original publications. This policy makes sense, because so many organizations are merged, splitted, or renamed, and Elsevier is hardly able to follow & imlement these changes, so they store the original information in the publications. at the same time, Scopus has an affiliation profiles that combine the publications with the corresponding affiliation texts. Those profiles can be queried & used for further analysis in Scopus UI and API. The inherent weakness of the profile approach is that due to an extremely high variability of the name variants, it is almost impossible to keep the profiles accurate. Elsevier allow the organizations to check and validate an accuracy of their affiliation profiles (off the subscription), but this activity is not mandatory for the organizations, so the profiles are cleaned and monitored mostly by the universities who care about their positions in the international rankings (this is approx. 50 out of 300+ Russian universities). When Scopus fails to recognize the affiliation text, they assign the publication to a temporary affiliation profile that may comprise as little as 1 - 2 publications. Such profiles, as a space dust, exist in Scopus unless the owner (university) asks to re-assign the publications to its official, one and only, affiliation profile.
CrossRef relies on the contents provided by the publisher. The service accepts the affiliation info, but many publishers neglect this opportunity. Soon we find out to what extent.
Lens aggregates an information from Microsoft Academic, CrossRef and Medline (all of them collect the affiliation info directly from the publications), so it is very interesting how accurate are the Lens records. Some affiliations in the Lens are matched to the GRID records, which is cool, as GRID which names stands for Global Research Identifier Database, is a database of research organizatons with CC0 license that “with a little help of all the world enthusiasts” can become as indispensible as Wikipedia. Yet this database bears the same problem as Scopus profiles - if the affiliation text is not standard, GRID can’t help, so such records are either lost for analysis or require an extra work.
So let’s check it.
Scopus dataset was downloaded as CSV file that often is simpler to process, the only problem we faced in the first part was merging the funding text fields see the previous part). But when it comes to affiliation info, the things are getting more difficult, as when an author has more than one affiliation the string in Scopus CSV looks like this:
“AUTHOR, Space Research Institute, Russian Academy of Sciences, Moscow, 117997, Russian Federation, Faculty of Physics, Moscow State University, Moscow, 119899, Russian Federation, National Research University Higher School of Economics, Moscow, Russian Federation”
AUTHOR, Institute of Physical Chemistry and Electrochemistry, Leibniz Universitat Hannover, Callinstr. 3-3a, Hannover, 30167, Germany, School of Physical Sciences, University of Kent, Canterbury, Kent, CT2 7NH, United Kingdom
There is no special delimiter placed between the different organziations, so the only way to obtain the unique {Author-Affiliation} pairs from Scopus CSV is using the regular expressions for the author names, and further splitting the strings after the country names. As it often happens, the procedure is not smooth and requires some workarounds. This time we extract only the author names and leave the remaining affiliation string in concatenated form.
sc.affils <- read_csv(paste0(dir, "/data/scopus/sc.data.csv"),
col_types = cols(ISSN = col_character())) %>%
select(DOI, "year"=Year, ISSN, "src.title"=`Source title`, eid=EID,
"auth.affils"=`Authors with affiliations`) %>%
mutate(DOI=trimws(tolower(DOI)), ISSN=toupper(ISSN),
src.title=trimws(toupper(src.title))) %>%
unique() %>%
# the author names (followed by the affilition info) are separated by semicolon
mutate(auth.affils=strsplit(auth.affils, split="; ")) %>%
unnest(auth.affils)
## this is a regular expression for author names in Scopus
formula="^[[:alpha:]\\-\\'\\s\\’\\(\\)]+[[:space:]\\.\\,]{1,2}[[:alpha:]\\.\\-]{1,10}(?=,)"
# this formual is best I could make up - it processes the names like "Bruce L.-I.", "Jesus Maria Gutierez C.F.N."
# Scopus put extra commas before suffixes like "Manino A, Jr."Carl Jenkins, IV.", see the example below with Jrs
sc.affils <- sc.affils %>% mutate(auth.affils=gsub(", Jr.,","Jr.,", auth.affils)) %>%
mutate(auth.affils=gsub(", Jr.","Jr.", auth.affils)) %>%
mutate(auth.affils=gsub(" Jr., ",", Jr.", auth.affils)) %>%
# the next 2 lines is a workaround to extract the author names that not succeeded with comma and text
mutate(auth.affils=paste0(auth.affils, sep=",")) %>%
mutate(auth.affils=gsub(", ,",",",auth.affils)) %>%
# extracting name
mutate(name = sapply(str_extract(auth.affils,pattern=formula),
function(x) unlist(x, use.names = FALSE))) %>%
# extracting the affiltext by removing the names
mutate(affline = sapply(str_replace(auth.affils, pattern= name, ""),
function(x) unlist(x, use.names = FALSE))) %>%
# cleaning the commas and spaces (both ends)
mutate(affline = sapply(str_replace_all(affline,
pattern="(^[[:space:][:punct:]]+|[[:space:][:punct:]]+$)", ""),
function(x) trimws(unlist(x, use.names = FALSE)))) %>%
# substituting all empty and almost cells for NA (1-symbol)
mutate(affline = ifelse(nchar(affline)<2,NA,affline)) %>%
select(-auth.affils)
sc.affils %>% write_excel_csv(paste0(dir, "/data/scopus/sc.affils.csv"))
Now we got a dataset with separate {author - mixture of author’s affiliations} pairs in rows.
## Observations: 10,182
## Variables: 7
## $ DOI <chr> "10.21517/0202-3822-2018-41-3-93-104", "10.21517/020...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ ISSN <chr> "02023822", "02023822", "02023822", "02023822", "020...
## $ src.title <chr> "PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, SERIES T...
## $ eid <chr> "2-s2.0-85056665925", "2-s2.0-85056665925", "2-s2.0-...
## $ name <chr> "Dokuka, V.N.", "Gostev, A.A.", "Khayrutdinov, R.R."...
## $ affline <chr> "State Research Center of Russian Federation, Troits...
We process the same JSON that we exported earlier.
# my using of purrr functions can seem amateurish, so it is, sorry
ls.data <- jsonlite::fromJSON(paste0(dir, "/data/lens/lens-export.json"), flatten=TRUE)
ls.authors <- ls.data %>% select(authors, lens_id) %>%
mutate(authors = map_if(authors, is.null, ~ tibble())) %>%
unnest(authors)
ls.affils <- ls.authors %>%
mutate(affiliations = map_if(affiliations, is_empty, ~ tibble())) %>%
unnest(affiliations) %>%
full_join(authors %>% select(-affiliations)) %>%
purrr::modify(~replace(.x,lengths(.x)==0,list(NA))) %>%
modify_if(~all(lengths(.x)==1),unlist)
ls.meta <- read_csv(paste0(dir, "/data/lens/ls.meta.csv")) %>%
select(lens_id, DOI="doi", src.title="source.title_full",
"year"= year_published, "type"=publication_type, print, electronic) %>%
filter(type!="journal-issue"&type!="journal issue") %>%
mutate(DOI=trimws(tolower(DOI)),
print=toupper(print),
electronic=toupper(electronic),
src.title=trimws(toupper(src.title))) %>%
## adding Scopus ISSN as we agreed to use them as connectors
mutate(ISSN=ifelse(print %in% sc.affils$ISSN, print,
ifelse(electronic %in% sc.affils$ISSN, electronic, NA))) %>%
unique()
# unnesting operations with ls.affils loses the NULL lines, we recover it via lef_join
ls.meta %>%
left_join(ls.affils, by=c("lens_id"))
write_excel_csv(paste0(dir, "/data/lens/ls.affils.csv"))
This is a Lens dataset with the author and affiliation metadata that we are going to use further.
## Observations: 13,624
## Variables: 16
## $ lens_id <chr> "000-101-321-426-866", "000-101-321-426-866", ...
## $ DOI <chr> "10.17580/gzh.2018.11.15", "10.17580/gzh.2018....
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ type <chr> "journal article", "journal article", "journal...
## $ print <chr> "00172278", "00172278", "00172278", "00172278"...
## $ electronic <chr> NA, NA, NA, NA, NA, "23134836", "23134836", "2...
## $ ISSN <chr> "00172278", "00172278", "00172278", "00172278"...
## $ collective_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ first_name <chr> "I. Yu.", "Moscow", "P. A.", "O. S.", "S. V.",...
## $ initials <chr> "IY", "M", "PA", "OS", "SV", "KP", NA, "В", "Д...
## $ last_name <chr> "Maslov", "Global Mining Explosive—Russia", "B...
## $ name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "R...
## $ grid.addresses <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "R...
## $ grid.id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "g...
## $ grid <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ src.title <chr> "GORNYI ZHURNAL", "GORNYI ZHURNAL", "GORNYI ZH...
CrossRef JSONs are unnested and merged into one dataset in a same way as the LENS json, except that we have to merg few files.
ff<-list.files(paste0(dir,"/data/crossref/"))
ff<-ff[grepl("works",ff)]
cr.data <- data.frame()
for (i in 1:length(ff)){
cr.json <- jsonlite::fromJSON(paste0(dir, "/data/crossref/",ff[i]), flatten=TRUE)
# again, to deal with lost NULL_containing lines, we process in two steps
cr.auths <- cr.json$message$items %>%
select(DOI, author) %>%
mutate(author = map_if(author, is_empty, ~ tibble())) %>%
unnest(author)
# now the addresses
cr.addresses<- cr.auths %>%
mutate(affiliation = map_if(affiliation, is_empty, ~ tibble())) %>%
unnest(affiliation)
# joining
cr.affils <- cr.auths %>% select(-affiliation) %>% left_join(cr.addresses)
# merging in one file
cr.data <- rbind(cr.data, cr.affils)
}
# the affiliation names are in both name and name 1 columns, we coalesce them
cr.data <- cr.data %>% mutate(name=coalesce(name, name1)) %>% select(-name1)
### adding metadata
cr.meta <- read_csv(paste0(dir, "/data/crossref/cr.meta.csv")) %>%
select(DOI, "src.title"=source.title, type,
"year"=pubdate, print, electronic) %>%
filter(type!="journal-issue"&type!="journal issue") %>%
# some formatting
mutate(DOI=trimws(tolower(DOI)),
print=toupper(print),
electronic=toupper(electronic),
src.title=trimws(toupper(src.title))) %>% unique() %>%
# adding Scopus ISSNs as connectors
mutate(ISSN=ifelse(print %in% sc.affils$ISSN, print,
ifelse(electronic %in% sc.affils$ISSN, electronic, NA))) %>%
unique()
cr.meta %>%
left_join(cr.data %>% select(-affiliation)) %>%
write_excel_csv(paste0(dir, "/data/crossref/cr.affils.csv"))
Below is a summary for dataset with the author & affiliation metadata from CrossRef records that we are going to use further.
cr.affils <- read_csv(paste0(dir, "/data/crossref/cr.affils.csv")) %>%
select(-src.title) %>% left_join(sc.labs) # making the universal src.title names
glimpse(cr.affils)
## Observations: 17,162
## Variables: 13
## $ DOI <chr> "10.32607/2075-8251-2017-9-4-84-91", "10...
## $ type <chr> "journal-article", "journal-article", "j...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ print <chr> "20758251", "20758251", "20758251", "207...
## $ electronic <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ISSN <chr> "20758251", "20758251", "20758251", "207...
## $ given <chr> "A. A.", NA, "I. G.", "T. K.", "V. A.", ...
## $ family <chr> "Panina", NA, "Dementieva", "Aliev", "To...
## $ sequence <chr> "first", "first", "additional", "additio...
## $ name <chr> NA, "Shemyakin-Ovchinnikov Institute of ...
## $ ORCID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `authenticated-orcid` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ src.title <chr> "ACTA NATURAE", "ACTA NATURAE", "ACTA NA...
Three datasets are compared by share of the publications containing an affiliation info.
# scopus
sc_x <- sc.affils %>%
mutate(status=!is.na(affline)) %>%
group_by(src.title, ISSN, eid) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(eid)) %>% ungroup()
# crossref
cr_x <- cr.affils %>%
mutate(status=!is.na(name1)) %>%
group_by(src.title, ISSN, DOI) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(DOI)) %>% ungroup()
# lens
ls_x <- ls.affils %>%
mutate(status=!is.na(name)) %>%
group_by(src.title, ISSN, lens_id) %>%
summarize(status=sum(status)) %>%
mutate(status=ifelse(status!=0, "present", "missing")) %>%
add_count(src.title) %>%
group_by(src.title, ISSN, status, n) %>%
summarize(pubs=n_distinct(lens_id)) %>% ungroup()
x_x <- rbind(ls_x %>% mutate(source="Lens"),
cr_x %>% mutate(source="CrossRef"),
sc_x%>% mutate(source="Scopus")) %>%
# making the labels (in advance)
mutate(label2=percent(pubs/n, accuracy=1)) %>%
# making the shorter labels for the journal names
mutate(label=ifelse(nchar(src.title)<64,
str_wrap(src.title,25),
str_wrap(paste0(substr(src.title, 1,64),"..."),25)))
x_x %>%
ggplot(aes(x=reorder(ISSN,n), y=pubs, fill=status))+
geom_col()+
geom_text(inherit.aes = FALSE, data=x_x[x_x$status=="present",],
aes(x=reorder(ISSN,n), y=n+2, label=percent(pubs/n, accuracy=1)),
size=2.7, fontface="bold", hjust=0)+
facet_wrap(~source, ncol=3)+
scale_y_continuous(expand = expand_scale(mult=c(0,0.15)))+
coord_flip()+
scale_fill_manual(values=c("coral", "#0072B2"), name="SOURCE")+
labs(title="PRESENCE OF AFFILIATION INFO",
subtitle="PUBYEAR: 2018-2019",
caption="Accessed: June 29, 2019",
y="PUBLICATIONS", x="ISSN")+
mytheme+
ggsave(paste0(dir, "/affils_long.png"), width=20, height = 12, units="cm", dpi=300)
Another projection of the same data, stressing the ratios.
Scopus contains almost complete affiliation records for 23 out of 24 journals. CrossRef records are still deficient in affiliation info, having empty lines for 19 out of 24 journals. Lens data is more heterogenous with a share of affiliation-containing records varying from zero to 90%.
The observed ratios should be reviewed with a notion that some of these journals are produced by the university publishing houses who may have lack of experience and motivation to prepare the extended metadata for CrossRef. The ratios for the international journals published in more professional manner can be less dramatic.
As we already found out that CrossRef is highly deficient in affiliation info, we are going to continue with the Lens and Scopus only. The next thing for us would be to assess how many records/authors miss the affiliation info while it is present in the original publications. In order to do that we define a set of the publications that are present in both the Lens and Scopus datasets, and have non-empty affiliation lines.
## filtering DOI numbers that have at least non-empty affiliaiton lines
doi_int <- intersect(
ls.affils %>% filter(!is.na(name)) %>% select(DOI),
sc.affils %>% filter(!is.na(affline)) %>% select(DOI)) %>%
na.omit()
An intersection of the Lens and Scopus publications with non-empty affiliation strings comprise 1357 DOIs (approx. 40% of all records of the Lens search results). Next step is to convert the data into a list of unique {DOI-Author} pairs and to count a number of affiliation lines for each pair.
# filtering datasets by pre-selected DOI
sc_y<- sc.affils %>% filter(DOI %in% doi_int$DOI) %>%
mutate(status=!is.na(affline)) %>%
mutate(author=toupper(gsub("\\.","", name))) %>%
# removing duplicates (one DOI, few PIDs)
select(DOI, ISSN, src.title, author, status) %>% unique() %>%
# counting number of affiliaitons for each {DOI-author} pair
group_by(DOI, ISSN, src.title, author) %>%
summarize(affil.count=pmax(status, na.rm = T)) %>% ungroup()
ls_y<- ls.affils %>% filter(DOI %in% doi_int$DOI) %>%
mutate(status=!is.na(name)) %>%
mutate(author=toupper(paste0(last_name, ", ",initials))) %>%
# removing duplicates (one DOI, few PIDs)
select(DOI, ISSN, src.title, author, status) %>% unique() %>%
# counting number of affiliaitons for each {DOI-author} pair
group_by(DOI, ISSN, src.title, author) %>%
summarize(affil.count=pmax(status, na.rm = T)) %>% ungroup()
Scopus returned 5398 unique {DOI-Author} pairs. After we calculate a number of authors and a number of affiliations for each DOI, we can filter the records that are likely to lose an affiliation info (i.e. those having different numbers of authors and affiliations).
sc_y %>%
group_by("Journal"= src.title, DOI) %>%
summarize(no.authors=n(), no.affiliations=sum(affil.count)) %>%
# filtering the records where No of authors =/= No of non-empty affiliation lines
filter(no.authors!=no.affiliations) %>%
datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>%
formatStyle(3:4, `text-align` = "center")
There are just 6 publications in Scopus dataset that have the authors without an affiliation info. Let’s review some of the examples.
10.1016/j.mencom.2018.07.027 from MENDELEEV COMMUNICATIONS contains an error in the original publication. The name of the last author is written with a comma “Nicolai, V.Bovin”, so all the other databases Scopus, CrossRef, Lens, MAG regarded this string as representing two authors - Nicolai and V.Bovin, and the latter got everything.
10.3367/ufne.2017.12.038309 from USPEKHI-PHYSICS has another type of irregularity. The affiliation records in the original publication are correct. CrossRef does not have an affiliation info json, MAG recognized just the Russian Academy of Sciences (while the original affiliations are 2 institutions of the Russian Academy of Sciences and MIPT University). Lens inherited an affiliation data from MAG, but lost MIPT, so it is also only RAS there. Scopus recognized all three organizations, but assigned all three affiliations only to the first author (seems like a accidental error, not a systemic bug).
10.1007/s11055-018-0655-8 from NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY has 6 authors and only 5 of them have the affiliation text in the original publication. CrossRef record has no affiliation info. All other databases replicated the error in their records.
10.15826/qr.2018.3.317 from QUAESTIO ROSSICA has 2 authors with a complete affiliation info in the original publication. Both authors are affiliated with Ural Federal University. The same information is registered in CrossRef. But as the journal has a bit ambiguous layout (at least a section for the author information), the other databases extracted also a name of the translator, so their records contain 3 authors - Lens, MAG, Scopus. In addition to that, MAG created a new affiliation line for all three authors, and Lens assigned that additional affiliation to the correct authors, and left only the translator to be affiliated with Ural Federal University.
Conclusion is quite obvious - missed affiliation lines in Scopus emerge from the errors in original publications (an unclear layout of the publication is also a publisher’s fault, to my humble opinion), and they are quite rare (6 out of 1357, less than 0.5%).
The Lens dataset contains 5513 unique {DOI-Author} pairs. A difference in a number of {DOI-Author} lines between two databases is interesting, we will find out the reason later, but first let’s see how many authors lack an affiliation info.
ls_y %>%
group_by("Journal"= src.title, DOI) %>%
summarize(no.authors=n(), no.affiliations=sum(affil.count)) %>%
# filtering the records where No of authors =/= No of non-empty affiliation lines
filter(no.authors!=no.affiliations) %>%
datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>%
formatStyle(3:4, `text-align` = "center")
There are just 9 publications (6 in Scopus) with authors missing an affiliation info. Worth of reviewing some of them.
10.14341/dm8339 from DIABETES MELLITUS seems to miss an affiliation info for 3 authors! The journal’s web site shows an affiliation info individually for each author. The PDF, however, shows only one affiliation line, as all authors are actually from one university. The record in CrossRef lack the affiliation info. MAG shows one affiliation for all 6 authors. Scopus has complete separate lines for each of 6 authors. The Lens record has three authors with affiliation info and three authors that lack it.
10.17323/1813-8691-2018-22-2-169-196 from HSE ECONOMIC JOURNAL has 2 authors in Lens, but one lacks the affiliation info. Scopus and CrossRef records have complete lines for both authors.
10.15826/qr.2018.1.290 from QUAESTIO ROSSICA. Do you remember that article about a backwardness of Russia that we reviewed it in the part 1? MAG stores three version of it, the Lens merged them into one record, but the record obtained two versions of the only author’s name (in Cyrillic and in Latin).
These rare cases are obviously just the accidents, not systemic bugs.
As some affiliation names in Scopus CSV are concatenated into one string without a special delimiter, one needs to make special efforts to extract a list of individual organizations. During this exciting journey, one will learn that there are many Lebanons in US (visit <https://lebanonusa.com/ to know more about that) and other fancy facts alike. Once the unique affiliation strings are extracted, they will have to be normalized , which due to the highly variative nature of the affiliations in general (and Russian affiliation in particular), promises another set of tortures. Usually I subsequently apply Geonames, GRID, Nominatim, and then there is still a lot of manual work to identify the broken names and abbreviations.
We are not going to do all of this now. Instead, we will do a quick test that I call “10 Russian city names”:
10 Russian cities are selected with 2 criteria: (a) they are relatively small, but there are some research organizations, (b) the city names are not present in the name of the related districts/regions/territories. The least criteria is required so that we could use the city names for searching without retrieving the irrelevant records (for instance, we can’t use a city named Kaluga, because there is also Kaluga region and there are other cities there that can appera in the affiliation lines together with the word Kaluga).
then I will make a text search for 10 names in Scopus dataset and filter the publications (DOIs) having those cities in the affiliation lines.
using the obtained list of DOIs, I will obtain full affiliation data from Scopus and the Lens datasets.
the individual affilaition lines will be matched & checked for consistency.
This test will allow us to check if the Lens records contain only the large & medium organizations (present in GRID), or also the names of smaller and less known organizations.
# first the name, I make all Cyrillic and Latin variants
rusaffs <- paste("Tobolsk|Tobol'sk|Тобольск", "Obninsk|Обнинск",
"Izhevsk|Ижевск", "Birobidzhan|Биробиджан",
"Vladikavkaz|Владикавказ", "Apatity|Апатиты",
"Sochi|Сочи", "Makhachkala|Махачкала",
"Saransk|Саранск", "Norilsk|Noril'sk|Norilsk|Норильск" , sep="|")
city.doi <- sc.affils %>%
filter(DOI %in% doi_int$DOI) %>%
filter(grepl(rusaffs, affline)) %>%
select(DOI) %>% unique()
sc.city.test <- sc.affils %>%
filter(DOI %in% city.doi$DOI) %>%
select(DOI, src.title, affline) %>% unique() %>%
# to ease the reading we split the Scopus affiliation lines here
mutate(affline=
strsplit(affline,
split=", Russian Federation,|, Ukraine,|, Israel,|, United Kingdom,|, Germany,"))%>%
unnest(affline) %>% unique() %>% arrange(affline) %>%
mutate(affline= paste0("- ", affline)) %>%
group_by(src.title, DOI) %>%
summarize(scopus.aff.names = paste0(affline, collapse="<br /> <br />"))
ls.city.test <- ls.affils %>%
filter(DOI %in% city.doi$DOI) %>%
select(DOI, src.title, name, grid.addresses, grid.id, grid) %>% unique() %>%
mutate(name= ifelse(is.na(grid.id), paste0("- ", name), paste0("- ", name,": ", grid.id))) %>%
group_by(src.title, DOI) %>%
summarize(lens.aff.names = paste0(name, collapse="<br /> <br />"))
From the dataset of 1357 DOIs (having an affiliation info in both Scopus and the Lens records) we filtered the publications containing the names of 10 Russian cities (in Cyrillic and Latin transliterations) and found just 33 unique publications. The table below shows the affiliation lines from both databases.
left_join(sc.city.test, ls.city.test) %>% ungroup() %>%
mutate(label=paste0(src.title, "<br />", DOI)) %>%
select(label, scopus.aff.names, lens.aff.names) %>%
datatable(escape=FALSE,
rownames = FALSE, filter = 'top',
options = list(pageLength = 5,
lengthMenu = c(5, 10, 20),
autoWidth = TRUE,
columnDefs = list(list(width = '450px', targets =c(1:2))))) %>%
formatStyle(1:4, `text-align` = "left")
This list can be difficult to apprehend from the first glance, so I will briefly review the records below and mark the typical problems.
20 out of 33 records inthe Lens contain a GRID name “Russian Academy of Sciences” with ID: grid.4886.2. The Russian Academy of Sciences (RAS) is not a typical organization, it is rather a huge network of the research institutes. Few years ago RAS comprised 750+ organizations that were restructured into 450+ researcher institutions (including 30+ federal research centers that may comprise 5-10 research institutions), arranged in 4 territorial branches (Ural, Siberian, Far Eastern, & Central). I believe, this explains why using “Russian Academy of Sciences” as an organization identifier is no better than saying “an organization from Siberia”. So 60% of the affiliation lines with RAS lost their specificity in the Lens.
For instance, 10.1007/s11055-018-0596-2 in NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY has 2 affiliation lines in Scopus:
Kola Scientific Center, Russian Academy of Sciences, Apatity, Russian Federation,
Sechenov Institute of Evolutionary Physiology and Biochemistry, Russian Academy of Sciences, St. Petersburg, Russian Federation,
and only one in the Lens - RAS.
Unfortunately, this issue is ubiquitous. Many reports about the Russian science, regardless of the source of data (were it Web of Science or Scopus) showed RAS listed among the indiidual organizations. Nature index (2019) merges all the RAS institutions into one account, but it is impossible to check if they did it correctly, since there are many research institutions in Russia that do not belong to RAS, some did belong in the past, but now are governed by other agencies. So the key takeaway here is that a structure of the Russian research institutions needs to be upgraded in GRID.
It is not only the general names that pose a problem, in the selected 33 publications I identified worrying cases of wrong (added / missing) affiliations.
10.7868/s0044513418010075 from ZOOLOGICHESKII ZHURNAL has two affiliations in the Lens record and in the original document, while Scopus shows only one.
10.7868/s0044513418020137 from ZOOLOGICHESKII ZHURNAL. The Lens record has two names in Cyrillic, same as in the original publication. The names are identical, both referring to Caucasian State Nature Biosphere Reserve, with the only difference that the first name lacks the city, while the second name points at Maikop (city in the Republic of Adygea). Scopus record has also two names - the second name points at Sochi (where a head office of the Caucasian State Nature Biosphere Reserve is located), but the first name points at Kabardino-Balkarian Scientific Centre, Tembotov Institute of Ecology of Mountain Territories, Russian Academy of Sciences, which is located in Nalchik, a city in the Republic of Kabardino-Balkaria (400 kms from Maikop). Even if all those centers are parts (in a legal sense) of the Reserve in question, shouldn’t the affiliation lines be more similar?
same journal, an original publication for 10.7868/s0044513418040037 and the Lens record show 3 authors affiliated with Ulyanovsk State University, Ulyanovsk State Pedagogical University, and the National Reserve Park Rechkinsky, located in the Votkinsky district of the Udmurt republic. Scopus record lists also 3 organizations, but there is only one from the original publication (Ulyanovsk State University), two others are located in… Israel and in Tobolsk (city in Tyumen region, 1000+ km from the National Park and 1500+ kms from Ulyanovsk).
the most funny case is found also in ZOOLOGICHESKII ZHURNAL. The article 10.7868/s0044513418050057 in Lens has the same affiliations as in the original publication - 2 organizations, one in Murmansk, the other in Ivanovo. Here, the most attentive readers can stop me - wait, those strange Russian names were not in the list of 10 cities. Exactly, my friends! Scopus record assigned the authors (whose names and other details are correct!) to the different organizations in other cities - Makhachkala and Vladivostok. I believe that few of you studied a geography of Russia in school, so in order to show you a distance between the cities recorded in the Lens and in Scopus I put them on a map created with Yandex map constructor.
Let’s review the last case:
The last thing left to do in this experiment with 10 cities is to count the correctly matching affiliations. I found only 8 out of 33 (24%) pubications having full lists of correct affiliations in both databases. In 6 out of 8 records, the correct lines in the Lens are in Cyrillic (i.e. they are not easy to find with GRID or in automatic mode).
left_join(sc.city.test, ls.city.test) %>% ungroup() %>%
filter(DOI %in% c("10.1134/s0010952518020089", "10.5800/gt-2018-9-4-0390",
"10.1007/s11055-019-00732-0", "10.7868/s0024114818010035",
"10.7868/s002411481802002x", "10.3103/s1063454118030068",
"10.7868/s0044513418040074", "10.7868/s0044513418050100")) %>%
mutate(label=paste0(src.title, "<br />", DOI)) %>%
select(label, scopus.aff.names, lens.aff.names) %>%
datatable(escape=FALSE,
rownames = FALSE, filter = 'top',
options = list(pageLength = 5,
lengthMenu = c(5, 10, 20),
autoWidth = TRUE,
columnDefs = list(list(width = '450px', targets =c(1:2))))) %>%
formatStyle(1:4, `text-align` = "left")
In our experiment Scopus overperformed Lens and CrossRef as a source of affiliation information. For 24 selected journals Scopus contained an affiliation info in 95% documents, Lens in 45%, CrossRef in 7% only. The reported ratios are specific for the journals (see Test 1).
Both Lens and Scopus lose a small portion of the affiliations due to the errors made in the original publications, or occurred during their own processing errors. (less than 1%, see Test 2)
Typical processing errors for the Lens records include:
duplication of the affiliation lines, caused either from multiple MAG records, or due to aggregation of Latin and Cyrillic author names. Less than 3% (see Test 3).
low specificity of the GRID-based recognition for organizations, or lost affiliations (see cases in Test 4).
In the randomly selected dataset of publications only 24% of the records had complete affiliations, yet only 25% of those were in Latin transcription in the Lens, the other records that have specific and complete affiliations were in Cyrillic. Other records beared the errors mentioned above.
The selected databases contain different formats of the author family names, given names, and the initials, which complicates a use of the author names for matching or other analysis.
The analysis showed that so far the Lens and CrossRef can’t be regarded as a comprehensive source of affiliation information. These results ask for additional efforts that an academic community need to make in order to stimulate the publishers to prepare richer metadata (and of proper quality) for the journal web sites, PDF files and XMLs supplied to CrossRef and other databases.
I am grateful to the Lens & CrossRef teams for what they do. Multiple thanks to all the experts, who care about sharing their experience and contribute to the community with free tutorials and kind advices. Love to the dearest R community.
Lutay, A.. (2019, July 6). Author and Affiliation Information in 24 Russian Journals in the Lens, Scopus, and CrossRef (Version 1). figshare. <https://doi.org/10.6084/m9.figshare.8786906>
Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
Hadley Wickham (2018). scales: Scale Functions for Visualization. R package version 1.0.0. https://CRAN.R-project.org/package=scales
Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont
Bob Rudis and Dave Gandy (2019). waffle: Create Waffle Chart Visualizations. R package version 1.0.0. https://github.com/hrbrmstr/waffle/tree/cran
Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.
Yihui Xie, Joe Cheng and Xianying Tan (2019). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.7. https://CRAN.R-project.org/package=DT