INTRO

In the previous part (rpubs.com) we selected 24 journals of Russian origin and investigated how their 2018-2019 contents is present in CrossRef, Scopus and the Lens. We found out that the coverage varies by journal, and a portion of publications that present in all three databases is near 75%.

This part is devoted to the author and affiliation information that has a wide application in scientometric analysis.

Scopus is known to extract an affiliation iformation as it is present in the original publications. This policy makes sense, because so many organizations are merged, splitted, or renamed, and Elsevier is hardly able to follow & imlement these changes, so they store the original information in the publications. at the same time, Scopus has an affiliation profiles that combine the publications with the corresponding affiliation texts. Those profiles can be queried & used for further analysis in Scopus UI and API. The inherent weakness of the profile approach is that due to an extremely high variability of the name variants, it is almost impossible to keep the profiles accurate. Elsevier allow the organizations to check and validate an accuracy of their affiliation profiles (off the subscription), but this activity is not mandatory for the organizations, so the profiles are cleaned and monitored mostly by the universities who care about their positions in the international rankings (this is approx. 50 out of 300+ Russian universities). When Scopus fails to recognize the affiliation text, they assign the publication to a temporary affiliation profile that may comprise as little as 1 - 2 publications. Such profiles, as a space dust, exist in Scopus unless the owner (university) asks to re-assign the publications to its official, one and only, affiliation profile.

CrossRef relies on the contents provided by the publisher. The service accepts the affiliation info, but many publishers neglect this opportunity. Soon we find out to what extent.

Lens aggregates an information from Microsoft Academic, CrossRef and Medline (all of them collect the affiliation info directly from the publications), so it is very interesting how accurate are the Lens records. Some affiliations in the Lens are matched to the GRID records, which is cool, as GRID which names stands for Global Research Identifier Database, is a database of research organizatons with CC0 license that “with a little help of all the world enthusiasts” can become as indispensible as Wikipedia. Yet this database bears the same problem as Scopus profiles - if the affiliation text is not standard, GRID can’t help, so such records are either lost for analysis or require an extra work.

So let’s check it.

SCOPUS DATA PROCESSING

Scopus dataset was downloaded as CSV file that often is simpler to process, the only problem we faced in the first part was merging the funding text fields see the previous part). But when it comes to affiliation info, the things are getting more difficult, as when an author has more than one affiliation the string in Scopus CSV looks like this:

There is no special delimiter placed between the different organziations, so the only way to obtain the unique {Author-Affiliation} pairs from Scopus CSV is using the regular expressions for the author names, and further splitting the strings after the country names. As it often happens, the procedure is not smooth and requires some workarounds. This time we extract only the author names and leave the remaining affiliation string in concatenated form.

sc.affils <- read_csv(paste0(dir, "/data/scopus/sc.data.csv"), 
                      col_types = cols(ISSN = col_character())) %>% 
  select(DOI, "year"=Year, ISSN, "src.title"=`Source title`, eid=EID, 
         "auth.affils"=`Authors with affiliations`) %>% 
  mutate(DOI=trimws(tolower(DOI)), ISSN=toupper(ISSN), 
         src.title=trimws(toupper(src.title))) %>% 
  unique() %>%
# the author names (followed by the affilition info) are separated by semicolon  
  mutate(auth.affils=strsplit(auth.affils, split="; ")) %>% 
  unnest(auth.affils)

## this is a regular expression for author names in Scopus
formula="^[[:alpha:]\\-\\'\\s\\’\\(\\)]+[[:space:]\\.\\,]{1,2}[[:alpha:]\\.\\-]{1,10}(?=,)"
# this formual is best I could make up - it processes the names like "Bruce L.-I.", "Jesus Maria Gutierez C.F.N."  
# Scopus put extra commas before suffixes like "Manino A, Jr."Carl Jenkins, IV.", see the example below with Jrs

sc.affils <- sc.affils %>% mutate(auth.affils=gsub(", Jr.,","Jr.,", auth.affils)) %>% 
  mutate(auth.affils=gsub(", Jr.","Jr.", auth.affils))  %>% 
  mutate(auth.affils=gsub(" Jr., ",", Jr.", auth.affils))  %>% 
# the next 2 lines is a workaround to extract the author names that not succeeded with comma and text  
  mutate(auth.affils=paste0(auth.affils, sep=",")) %>% 
  mutate(auth.affils=gsub(", ,",",",auth.affils)) %>% 
# extracting name  
  mutate(name = sapply(str_extract(auth.affils,pattern=formula), 
                       function(x) unlist(x, use.names = FALSE))) %>%  
# extracting the affiltext by removing the names
  mutate(affline = sapply(str_replace(auth.affils, pattern= name, ""), 
                       function(x) unlist(x, use.names = FALSE))) %>%
# cleaning the commas and spaces (both ends)
  mutate(affline = sapply(str_replace_all(affline,
                                      pattern="(^[[:space:][:punct:]]+|[[:space:][:punct:]]+$)", ""),
                      function(x) trimws(unlist(x, use.names = FALSE)))) %>%
# substituting all empty and almost cells for NA (1-symbol) 
  mutate(affline = ifelse(nchar(affline)<2,NA,affline)) %>% 
  select(-auth.affils)

sc.affils %>% write_excel_csv(paste0(dir, "/data/scopus/sc.affils.csv"))

Now we got a dataset with separate {author - mixture of author’s affiliations} pairs in rows.

## Observations: 10,182
## Variables: 7
## $ DOI       <chr> "10.21517/0202-3822-2018-41-3-93-104", "10.21517/020...
## $ year      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ ISSN      <chr> "02023822", "02023822", "02023822", "02023822", "020...
## $ src.title <chr> "PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, SERIES T...
## $ eid       <chr> "2-s2.0-85056665925", "2-s2.0-85056665925", "2-s2.0-...
## $ name      <chr> "Dokuka, V.N.", "Gostev, A.A.", "Khayrutdinov, R.R."...
## $ affline   <chr> "State Research Center of Russian Federation, Troits...

LENS DATA PROCESSING

We process the same JSON that we exported earlier.

# my using of purrr functions can seem amateurish, so it is, sorry 
ls.data <- jsonlite::fromJSON(paste0(dir, "/data/lens/lens-export.json"), flatten=TRUE)

ls.authors <- ls.data %>% select(authors, lens_id)  %>% 
  mutate(authors = map_if(authors, is.null, ~ tibble())) %>%
  unnest(authors)

ls.affils <- ls.authors %>% 
  mutate(affiliations = map_if(affiliations, is_empty, ~ tibble())) %>%
  unnest(affiliations) %>% 
  full_join(authors %>% select(-affiliations)) %>% 
  purrr::modify(~replace(.x,lengths(.x)==0,list(NA))) %>% 
  modify_if(~all(lengths(.x)==1),unlist)

ls.meta <- read_csv(paste0(dir, "/data/lens/ls.meta.csv")) %>% 
  select(lens_id, DOI="doi", src.title="source.title_full",
         "year"= year_published, "type"=publication_type, print, electronic) %>%
  filter(type!="journal-issue"&type!="journal issue") %>% 
  mutate(DOI=trimws(tolower(DOI)),
         print=toupper(print), 
         electronic=toupper(electronic),
         src.title=trimws(toupper(src.title))) %>% 
## adding Scopus ISSN as we agreed to use them as connectors
  mutate(ISSN=ifelse(print %in% sc.affils$ISSN, print,
                     ifelse(electronic %in% sc.affils$ISSN, electronic, NA))) %>% 
  unique() 

# unnesting operations with ls.affils loses the NULL lines, we recover it via lef_join 
ls.meta %>% 
  left_join(ls.affils, by=c("lens_id"))  
  write_excel_csv(paste0(dir, "/data/lens/ls.affils.csv"))

This is a Lens dataset with the author and affiliation metadata that we are going to use further.

## Observations: 13,624
## Variables: 16
## $ lens_id         <chr> "000-101-321-426-866", "000-101-321-426-866", ...
## $ DOI             <chr> "10.17580/gzh.2018.11.15", "10.17580/gzh.2018....
## $ year            <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ type            <chr> "journal article", "journal article", "journal...
## $ print           <chr> "00172278", "00172278", "00172278", "00172278"...
## $ electronic      <chr> NA, NA, NA, NA, NA, "23134836", "23134836", "2...
## $ ISSN            <chr> "00172278", "00172278", "00172278", "00172278"...
## $ collective_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ first_name      <chr> "I. Yu.", "Moscow", "P. A.", "O. S.", "S. V.",...
## $ initials        <chr> "IY", "M", "PA", "OS", "SV", "KP", NA, "В", "Д...
## $ last_name       <chr> "Maslov", "Global Mining Explosive—Russia", "B...
## $ name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "R...
## $ grid.addresses  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "R...
## $ grid.id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "g...
## $ grid            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ src.title       <chr> "GORNYI ZHURNAL", "GORNYI ZHURNAL", "GORNYI ZH...

CROSSREF DATA PROCESSING

CrossRef JSONs are unnested and merged into one dataset in a same way as the LENS json, except that we have to merg few files.

ff<-list.files(paste0(dir,"/data/crossref/"))
ff<-ff[grepl("works",ff)]
cr.data <- data.frame()

for (i in 1:length(ff)){
  cr.json <- jsonlite::fromJSON(paste0(dir, "/data/crossref/",ff[i]), flatten=TRUE)
# again, to deal with lost NULL_containing lines, we process in two steps
  cr.auths <- cr.json$message$items %>%
    select(DOI, author) %>%
    mutate(author = map_if(author, is_empty, ~ tibble())) %>%
    unnest(author) 
# now the addresses  
  cr.addresses<- cr.auths %>% 
    mutate(affiliation = map_if(affiliation, is_empty, ~ tibble())) %>%
    unnest(affiliation)  

# joining  
cr.affils <- cr.auths %>% select(-affiliation) %>% left_join(cr.addresses)
# merging in one file      
cr.data <- rbind(cr.data, cr.affils)
}

# the affiliation names are in both name and name 1 columns, we coalesce them
cr.data <- cr.data %>% mutate(name=coalesce(name, name1)) %>% select(-name1)
### adding metadata
cr.meta <- read_csv(paste0(dir, "/data/crossref/cr.meta.csv")) %>%
  select(DOI, "src.title"=source.title, type, 
         "year"=pubdate, print, electronic) %>%
  filter(type!="journal-issue"&type!="journal issue") %>% 
# some formatting
  mutate(DOI=trimws(tolower(DOI)),
         print=toupper(print), 
         electronic=toupper(electronic),
         src.title=trimws(toupper(src.title))) %>% unique() %>% 
# adding Scopus ISSNs as connectors
  mutate(ISSN=ifelse(print %in% sc.affils$ISSN, print,
                     ifelse(electronic %in% sc.affils$ISSN, electronic, NA))) %>% 
  unique()   

cr.meta %>% 
  left_join(cr.data %>% select(-affiliation)) %>% 
  write_excel_csv(paste0(dir, "/data/crossref/cr.affils.csv"))

Below is a summary for dataset with the author & affiliation metadata from CrossRef records that we are going to use further.

cr.affils <- read_csv(paste0(dir, "/data/crossref/cr.affils.csv")) %>% 
     select(-src.title) %>% left_join(sc.labs) # making the universal src.title names
glimpse(cr.affils)
## Observations: 17,162
## Variables: 13
## $ DOI                   <chr> "10.32607/2075-8251-2017-9-4-84-91", "10...
## $ type                  <chr> "journal-article", "journal-article", "j...
## $ year                  <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018...
## $ print                 <chr> "20758251", "20758251", "20758251", "207...
## $ electronic            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ISSN                  <chr> "20758251", "20758251", "20758251", "207...
## $ given                 <chr> "A. A.", NA, "I. G.", "T. K.", "V. A.", ...
## $ family                <chr> "Panina", NA, "Dementieva", "Aliev", "To...
## $ sequence              <chr> "first", "first", "additional", "additio...
## $ name                  <chr> NA, "Shemyakin-Ovchinnikov Institute of ...
## $ ORCID                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ `authenticated-orcid` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ src.title             <chr> "ACTA NATURAE", "ACTA NATURAE", "ACTA NA...

TEST 1. PRESENCE OF AFFILIATION DATA IN THE DATABASES

Three datasets are compared by share of the publications containing an affiliation info.

# scopus
sc_x <- sc.affils %>%
  mutate(status=!is.na(affline)) %>%
  group_by(src.title, ISSN, eid) %>% 
  summarize(status=sum(status)) %>% 
  mutate(status=ifelse(status!=0, "present", "missing")) %>%
  add_count(src.title) %>% 
  group_by(src.title, ISSN, status, n) %>% 
  summarize(pubs=n_distinct(eid)) %>% ungroup()

# crossref
cr_x <- cr.affils %>%
  mutate(status=!is.na(name1)) %>%
  group_by(src.title, ISSN, DOI) %>% 
  summarize(status=sum(status)) %>% 
  mutate(status=ifelse(status!=0, "present", "missing")) %>%
  add_count(src.title) %>% 
  group_by(src.title, ISSN, status, n) %>% 
  summarize(pubs=n_distinct(DOI)) %>% ungroup()

# lens
ls_x <- ls.affils %>%
  mutate(status=!is.na(name)) %>%
  group_by(src.title, ISSN, lens_id) %>% 
  summarize(status=sum(status)) %>% 
  mutate(status=ifelse(status!=0, "present", "missing")) %>%
  add_count(src.title) %>% 
  group_by(src.title, ISSN, status, n) %>% 
  summarize(pubs=n_distinct(lens_id)) %>% ungroup()

x_x <- rbind(ls_x %>% mutate(source="Lens"), 
             cr_x %>% mutate(source="CrossRef"),
             sc_x%>% mutate(source="Scopus")) %>% 
# making the labels (in advance) 
  mutate(label2=percent(pubs/n, accuracy=1)) %>% 
# making the shorter labels for the journal names
  mutate(label=ifelse(nchar(src.title)<64,
                      str_wrap(src.title,25),
                      str_wrap(paste0(substr(src.title, 1,64),"..."),25)))

x_x %>% 
  ggplot(aes(x=reorder(ISSN,n), y=pubs, fill=status))+
  geom_col()+
  geom_text(inherit.aes = FALSE, data=x_x[x_x$status=="present",],
            aes(x=reorder(ISSN,n), y=n+2, label=percent(pubs/n, accuracy=1)),
            size=2.7, fontface="bold", hjust=0)+
  facet_wrap(~source, ncol=3)+
  scale_y_continuous(expand = expand_scale(mult=c(0,0.15)))+
  coord_flip()+
  scale_fill_manual(values=c("coral", "#0072B2"), name="SOURCE")+
  labs(title="PRESENCE OF AFFILIATION INFO",
       subtitle="PUBYEAR: 2018-2019",
       caption="Accessed: June 29, 2019", 
       y="PUBLICATIONS", x="ISSN")+
  mytheme+
  ggsave(paste0(dir, "/affils_long.png"), width=20, height = 12, units="cm", dpi=300)

Another projection of the same data, stressing the ratios.

Scopus contains almost complete affiliation records for 23 out of 24 journals. CrossRef records are still deficient in affiliation info, having empty lines for 19 out of 24 journals. Lens data is more heterogenous with a share of affiliation-containing records varying from zero to 90%.

The observed ratios should be reviewed with a notion that some of these journals are produced by the university publishing houses who may have lack of experience and motivation to prepare the extended metadata for CrossRef. The ratios for the international journals published in more professional manner can be less dramatic.

TEST 2. MISSING ORGANIZATION DATA

As we already found out that CrossRef is highly deficient in affiliation info, we are going to continue with the Lens and Scopus only. The next thing for us would be to assess how many records/authors miss the affiliation info while it is present in the original publications. In order to do that we define a set of the publications that are present in both the Lens and Scopus datasets, and have non-empty affiliation lines.

## filtering DOI numbers that have at least non-empty affiliaiton lines
doi_int <- intersect(
  ls.affils %>% filter(!is.na(name)) %>% select(DOI),
  sc.affils %>% filter(!is.na(affline)) %>% select(DOI)) %>% 
  na.omit()

An intersection of the Lens and Scopus publications with non-empty affiliation strings comprise 1357 DOIs (approx. 40% of all records of the Lens search results). Next step is to convert the data into a list of unique {DOI-Author} pairs and to count a number of affiliation lines for each pair.

# filtering datasets by pre-selected DOI
sc_y<- sc.affils %>% filter(DOI %in% doi_int$DOI) %>% 
  mutate(status=!is.na(affline)) %>%
  mutate(author=toupper(gsub("\\.","", name))) %>% 
# removing duplicates (one DOI, few PIDs) 
  select(DOI, ISSN, src.title, author, status) %>% unique() %>% 
# counting number of affiliaitons for each {DOI-author} pair
  group_by(DOI, ISSN, src.title, author) %>% 
 summarize(affil.count=pmax(status, na.rm = T)) %>%  ungroup()

ls_y<- ls.affils %>% filter(DOI %in% doi_int$DOI) %>% 
  mutate(status=!is.na(name)) %>%
  mutate(author=toupper(paste0(last_name, ", ",initials))) %>%
# removing duplicates (one DOI, few PIDs) 
  select(DOI, ISSN, src.title, author, status) %>% unique() %>% 
# counting number of affiliaitons for each {DOI-author} pair
  group_by(DOI, ISSN, src.title, author) %>% 
 summarize(affil.count=pmax(status, na.rm = T)) %>%  ungroup()

MISSED AFFILIATION STRINGS IN SCOPUS

Scopus returned 5398 unique {DOI-Author} pairs. After we calculate a number of authors and a number of affiliations for each DOI, we can filter the records that are likely to lose an affiliation info (i.e. those having different numbers of authors and affiliations).

sc_y %>% 
  group_by("Journal"= src.title, DOI) %>% 
  summarize(no.authors=n(), no.affiliations=sum(affil.count)) %>% 
# filtering the records where No of authors =/= No of non-empty affiliation lines
  filter(no.authors!=no.affiliations) %>%
  datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>% 
  formatStyle(3:4, `text-align` = "center")

There are just 6 publications in Scopus dataset that have the authors without an affiliation info. Let’s review some of the examples.

  • 10.1016/j.mencom.2018.07.027 from MENDELEEV COMMUNICATIONS contains an error in the original publication. The name of the last author is written with a comma “Nicolai, V.Bovin”, so all the other databases Scopus, CrossRef, Lens, MAG regarded this string as representing two authors - Nicolai and V.Bovin, and the latter got everything.

  • 10.3367/ufne.2017.12.038309 from USPEKHI-PHYSICS has another type of irregularity. The affiliation records in the original publication are correct. CrossRef does not have an affiliation info json, MAG recognized just the Russian Academy of Sciences (while the original affiliations are 2 institutions of the Russian Academy of Sciences and MIPT University). Lens inherited an affiliation data from MAG, but lost MIPT, so it is also only RAS there. Scopus recognized all three organizations, but assigned all three affiliations only to the first author (seems like a accidental error, not a systemic bug).

  • 10.1007/s11055-018-0655-8 from NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY has 6 authors and only 5 of them have the affiliation text in the original publication. CrossRef record has no affiliation info. All other databases replicated the error in their records.

  • 10.15826/qr.2018.3.317 from QUAESTIO ROSSICA has 2 authors with a complete affiliation info in the original publication. Both authors are affiliated with Ural Federal University. The same information is registered in CrossRef. But as the journal has a bit ambiguous layout (at least a section for the author information), the other databases extracted also a name of the translator, so their records contain 3 authors - Lens, MAG, Scopus. In addition to that, MAG created a new affiliation line for all three authors, and Lens assigned that additional affiliation to the correct authors, and left only the translator to be affiliated with Ural Federal University.

Conclusion is quite obvious - missed affiliation lines in Scopus emerge from the errors in original publications (an unclear layout of the publication is also a publisher’s fault, to my humble opinion), and they are quite rare (6 out of 1357, less than 0.5%).

MISSED AFFILIATION STRINGS IN LENS

The Lens dataset contains 5513 unique {DOI-Author} pairs. A difference in a number of {DOI-Author} lines between two databases is interesting, we will find out the reason later, but first let’s see how many authors lack an affiliation info.

ls_y %>% 
  group_by("Journal"= src.title, DOI) %>% 
  summarize(no.authors=n(), no.affiliations=sum(affil.count)) %>% 
# filtering the records where No of authors =/= No of non-empty affiliation lines
  filter(no.authors!=no.affiliations) %>% 
  datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>% 
  formatStyle(3:4, `text-align` = "center")

There are just 9 publications (6 in Scopus) with authors missing an affiliation info. Worth of reviewing some of them.

  • 10.14341/dm8339 from DIABETES MELLITUS seems to miss an affiliation info for 3 authors! The journal’s web site shows an affiliation info individually for each author. The PDF, however, shows only one affiliation line, as all authors are actually from one university. The record in CrossRef lack the affiliation info. MAG shows one affiliation for all 6 authors. Scopus has complete separate lines for each of 6 authors. The Lens record has three authors with affiliation info and three authors that lack it.

  • 10.17323/1813-8691-2018-22-2-169-196 from HSE ECONOMIC JOURNAL has 2 authors in Lens, but one lacks the affiliation info. Scopus and CrossRef records have complete lines for both authors.

  • 10.15826/qr.2018.1.290 from QUAESTIO ROSSICA. Do you remember that article about a backwardness of Russia that we reviewed it in the part 1? MAG stores three version of it, the Lens merged them into one record, but the record obtained two versions of the only author’s name (in Cyrillic and in Latin).

These rare cases are obviously just the accidents, not systemic bugs.

TEST 3. MISSING AUTHOR DATA

Next step, we can identify the records that have the different numbers of authors in the Lens and in Scopus. For that we count a number of authors for every DOI, in both Scopus and the Lens datasets, and then filter those that do not have equal numbers.

In order to visualize such articles, we will use the development version of the waffle R package and its fantastic function geom_waffle that allows building the waffle charts with ggplot (very useful!).

y_y2 <- full_join(sc_y %>% count(ISSN, src.title, DOI, name = "no.authors"),
                  ls_y %>% count(ISSN, DOI, name = "no.authors"),
                by=c("DOI", "ISSN"), suffix=c(".lens", ".scopus")) %>% 
# filtering the records with different number of authors 
  mutate(status=(no.authors.scopus==no.authors.lens)) %>%
# grouping around source titles (journals)
  group_by(ISSN, src.title, status) %>% summarize(n=n()) %>% ungroup()

# install the development version of waffle package
# I did it with githubinstall::gh_install_packages("hrbrmstr/waffle")
library(waffle)

y_y2 %>% 
# shortening the labels  
  mutate(label=ifelse(nchar(src.title)<64, 
                      str_wrap(src.title,25),
                      str_wrap(paste0(substr(src.title, 1,64),"..."),25))) %>% 
  ggplot(aes(fill = status, values = n)) + 
  geom_waffle(color = "white", size = .25, n_rows = 20, flip = TRUE) +
  facet_wrap(~ label, ncol = 6, strip.position="bottom") +
  scale_x_discrete() + 
  scale_y_continuous(labels = function(x) x * 20, # make this multiplyer the same as n_rows
                     expand = c(0,0)) +
  scale_fill_discrete(name=NULL, labels=c("equal", "different"))+
  labs(title = "AUTHOR INFO IN THE LENS AND SCOPUS RECORDS",
       subtitle = "QUESTION: Is there the same number of authors?",
    x = "1356 DOIs, 1 SQUARE = 1 DOI", y = "PUBLICATIONS (COUNT)", caption="Accessed: June 29, 2019") +
  guides(fill = guide_legend(reverse = TRUE))+
  coord_equal()+
  mytheme+
  theme(legend.position = "bottom",
        legend.justification = c("right", "top"),
        strip.text = element_text(size=rel(0.6), hjust=0),
        strip.background = element_rect(fill="lightyellow", color=NA),
    panel.grid = element_blank(), 
    axis.line.y = element_line(),
    axis.ticks.y = element_line()) +
  ggsave(paste0(dir, "/affils_lost_authors.png"), 
         width=20, height = 13, units="cm", dpi=300)

Due to a piecemal presence of the affiliation info in the Lens, the selected DOIs unequally represent the journals. Some journals are not present at all, as their publications lack an affiliation info in the Lens. Few journals are present just with few DOIs. Let’s review the publications that have different number of authors and try to figure out a source of such discrepancy.

full_join(sc_y %>% count(src.title, DOI, name = "no.authors"),
          ls_y %>% count(src.title, DOI, name = "no.authors"),
          by=c("DOI","src.title"), suffix=c(".scopus",".lens")) %>% 
  filter(no.authors.scopus!=no.authors.lens) %>% 
  # check if the numbers of authors are just doubled
  mutate(doubled=(no.authors.lens==2*no.authors.scopus)) %>% arrange(doubled) %>% 
  datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>% 
  formatStyle(3:5, `text-align` = "center")

There are only 33 articles with unequal numbers of authors in Scopus and the Lens, and except for 3 articles in NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY, the numbers have two-fold difference. Such duplication can be caused by merging the records in Cyrillic and Latin transcription. Majority of such publications emerge from 2 journals - DIABETES MELLITUS OR PEDIATRIC TRAUMATOLOGY, ORTHOPAEDICS AND RECONSTRUCTIVE SURGERY. Based on a number of such publications, it is hard to conclude on the real reasons - whether it was just one journal issue incorrectly prepared or did the editorial board try something special like a bilingual issue. The other publications from those journals are not present on the chart, as they lack an affiliation info in the Lens.

For instance, the Lens record for the publication 10.14341/dm9429 from DIABETES MELLITUS has duplicated both the authors, and the affiliations:

The total difference in the numbers of authors present in 33 selected publications is 115, which is exactly the difference that we observed for the numbers of unique {DOI-author} pairs in Scopus and Lens datasets (see above, Test 2).

TEST 4. MATCHING ORGANIZATIONS

As some affiliation names in Scopus CSV are concatenated into one string without a special delimiter, one needs to make special efforts to extract a list of individual organizations. During this exciting journey, one will learn that there are many Lebanons in US (visit <https://lebanonusa.com/ to know more about that) and other fancy facts alike. Once the unique affiliation strings are extracted, they will have to be normalized , which due to the highly variative nature of the affiliations in general (and Russian affiliation in particular), promises another set of tortures. Usually I subsequently apply Geonames, GRID, Nominatim, and then there is still a lot of manual work to identify the broken names and abbreviations.

We are not going to do all of this now. Instead, we will do a quick test that I call “10 Russian city names”:

This test will allow us to check if the Lens records contain only the large & medium organizations (present in GRID), or also the names of smaller and less known organizations.

# first the name, I make all Cyrillic and Latin variants 
rusaffs <- paste("Tobolsk|Tobol'sk|Тобольск", "Obninsk|Обнинск", 
            "Izhevsk|Ижевск", "Birobidzhan|Биробиджан",
            "Vladikavkaz|Владикавказ", "Apatity|Апатиты", 
            "Sochi|Сочи", "Makhachkala|Махачкала",
            "Saransk|Саранск", "Norilsk|Noril'sk|Norilsk|Норильск" , sep="|")

city.doi <- sc.affils %>% 
  filter(DOI %in% doi_int$DOI) %>% 
  filter(grepl(rusaffs, affline)) %>% 
  select(DOI) %>% unique()
  
sc.city.test <- sc.affils %>% 
  filter(DOI %in% city.doi$DOI) %>%
  select(DOI, src.title, affline) %>% unique() %>%
# to ease the reading we split the Scopus affiliation lines here  
  mutate(affline=
           strsplit(affline, 
                    split=", Russian Federation,|, Ukraine,|, Israel,|, United Kingdom,|, Germany,"))%>%
  unnest(affline) %>% unique() %>% arrange(affline) %>% 
  mutate(affline= paste0("- ", affline)) %>% 
  group_by(src.title, DOI) %>% 
  summarize(scopus.aff.names =  paste0(affline, collapse="<br /> <br />"))

ls.city.test <- ls.affils %>% 
  filter(DOI %in% city.doi$DOI) %>% 
  select(DOI, src.title, name, grid.addresses, grid.id, grid) %>% unique() %>%
  mutate(name= ifelse(is.na(grid.id), paste0("- ", name), paste0("- ", name,": ", grid.id))) %>% 
  group_by(src.title, DOI) %>% 
  summarize(lens.aff.names =  paste0(name, collapse="<br /> <br />"))

From the dataset of 1357 DOIs (having an affiliation info in both Scopus and the Lens records) we filtered the publications containing the names of 10 Russian cities (in Cyrillic and Latin transliterations) and found just 33 unique publications. The table below shows the affiliation lines from both databases.

left_join(sc.city.test, ls.city.test) %>% ungroup() %>% 
  mutate(label=paste0(src.title, "<br />", DOI)) %>% 
  select(label, scopus.aff.names, lens.aff.names) %>%   
  datatable(escape=FALSE,
          rownames = FALSE, filter = 'top', 
          options = list(pageLength = 5,
                         lengthMenu = c(5, 10, 20),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '450px',  targets =c(1:2))))) %>% 
  formatStyle(1:4, `text-align` = "left")

This list can be difficult to apprehend from the first glance, so I will briefly review the records below and mark the typical problems.

TOO GENERAL NAMES

20 out of 33 records inthe Lens contain a GRID name “Russian Academy of Sciences” with ID: grid.4886.2. The Russian Academy of Sciences (RAS) is not a typical organization, it is rather a huge network of the research institutes. Few years ago RAS comprised 750+ organizations that were restructured into 450+ researcher institutions (including 30+ federal research centers that may comprise 5-10 research institutions), arranged in 4 territorial branches (Ural, Siberian, Far Eastern, & Central). I believe, this explains why using “Russian Academy of Sciences” as an organization identifier is no better than saying “an organization from Siberia”. So 60% of the affiliation lines with RAS lost their specificity in the Lens.

For instance, 10.1007/s11055-018-0596-2 in NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY has 2 affiliation lines in Scopus:

  • Kola Scientific Center, Russian Academy of Sciences, Apatity, Russian Federation,

  • Sechenov Institute of Evolutionary Physiology and Biochemistry, Russian Academy of Sciences, St. Petersburg, Russian Federation,

and only one in the Lens - RAS.

Unfortunately, this issue is ubiquitous. Many reports about the Russian science, regardless of the source of data (were it Web of Science or Scopus) showed RAS listed among the indiidual organizations. Nature index (2019) merges all the RAS institutions into one account, but it is impossible to check if they did it correctly, since there are many research institutions in Russia that do not belong to RAS, some did belong in the past, but now are governed by other agencies. So the key takeaway here is that a structure of the Russian research institutions needs to be upgraded in GRID.

WRONG NAMES

It is not only the general names that pose a problem, in the selected 33 publications I identified worrying cases of wrong (added / missing) affiliations.

  • 10.7868/s0044513418010075 from ZOOLOGICHESKII ZHURNAL has two affiliations in the Lens record and in the original document, while Scopus shows only one.

  • 10.7868/s0044513418020137 from ZOOLOGICHESKII ZHURNAL. The Lens record has two names in Cyrillic, same as in the original publication. The names are identical, both referring to Caucasian State Nature Biosphere Reserve, with the only difference that the first name lacks the city, while the second name points at Maikop (city in the Republic of Adygea). Scopus record has also two names - the second name points at Sochi (where a head office of the Caucasian State Nature Biosphere Reserve is located), but the first name points at Kabardino-Balkarian Scientific Centre, Tembotov Institute of Ecology of Mountain Territories, Russian Academy of Sciences, which is located in Nalchik, a city in the Republic of Kabardino-Balkaria (400 kms from Maikop). Even if all those centers are parts (in a legal sense) of the Reserve in question, shouldn’t the affiliation lines be more similar?

  • same journal, an original publication for 10.7868/s0044513418040037 and the Lens record show 3 authors affiliated with Ulyanovsk State University, Ulyanovsk State Pedagogical University, and the National Reserve Park Rechkinsky, located in the Votkinsky district of the Udmurt republic. Scopus record lists also 3 organizations, but there is only one from the original publication (Ulyanovsk State University), two others are located in… Israel and in Tobolsk (city in Tyumen region, 1000+ km from the National Park and 1500+ kms from Ulyanovsk).

  • the most funny case is found also in ZOOLOGICHESKII ZHURNAL. The article 10.7868/s0044513418050057 in Lens has the same affiliations as in the original publication - 2 organizations, one in Murmansk, the other in Ivanovo. Here, the most attentive readers can stop me - wait, those strange Russian names were not in the list of 10 cities. Exactly, my friends! Scopus record assigned the authors (whose names and other details are correct!) to the different organizations in other cities - Makhachkala and Vladivostok. I believe that few of you studied a geography of Russia in school, so in order to show you a distance between the cities recorded in the Lens and in Scopus I put them on a map created with Yandex map constructor.

Let’s review the last case:

  • 10.1016/j.mencom.2018.07.016 from MENDELEEV COMMUNICATIONS. The original article, published by Elsevier, has 5 affiliations: (1). M. V. Lomonosov Moscow State University; (2) D. Rogachev National Research Center of Pediatric Hematology, Oncology and Immunology; (3) Institute of Physiologically Active Compounds, Russian Academy of Sciences; (4) Medical Radiology Research Center, Russian Academy of Medical Sciences; (5) A. N. Nesmeyanov Institute of Organoelement Compounds, Russian Academy of Sciences. Scopus record reproduced it accurately. The Lens record (as well as that in MAG), has a less accurate list: (A) Moscow State University; (B) Russian Academy of Sciences; (C) Academy of Medical Sciences, United Kingdom; (D) A. N. Nesmeyanov Institute of Organoelement Compounds. Two are correct (1=A, 5=D), one is generic (B matches both 3 and 5), and one is wrong, pointing at UK institution (C). It is very disappointing that the names of two leading Russian institutions making a research in oncology (2 and 4) are misrepresented due to the technical errors.

The last thing left to do in this experiment with 10 cities is to count the correctly matching affiliations. I found only 8 out of 33 (24%) pubications having full lists of correct affiliations in both databases. In 6 out of 8 records, the correct lines in the Lens are in Cyrillic (i.e. they are not easy to find with GRID or in automatic mode).

left_join(sc.city.test, ls.city.test) %>% ungroup() %>% 
  filter(DOI %in% c("10.1134/s0010952518020089", "10.5800/gt-2018-9-4-0390", 
                              "10.1007/s11055-019-00732-0", "10.7868/s0024114818010035", 
                              "10.7868/s002411481802002x", "10.3103/s1063454118030068", 
                              "10.7868/s0044513418040074", "10.7868/s0044513418050100")) %>% 
  mutate(label=paste0(src.title, "<br />", DOI)) %>% 
  select(label, scopus.aff.names, lens.aff.names) %>%   
  datatable(escape=FALSE,
          rownames = FALSE, filter = 'top', 
          options = list(pageLength = 5,
                         lengthMenu = c(5, 10, 20),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '450px',  targets =c(1:2))))) %>% 
  formatStyle(1:4, `text-align` = "left")

TEST 5. MATCHING AUTHOR NAMES

One may think that the author names should hide no problems - either it is in Latin, or in Cyrillic, that’s it. But the same name can be written in different way. In this section we take the list of DOIs present in all three databases and check how many {DOI-author name} are present in all 3 DBs and how many author names are unique for one or other source.

We will be using 3 models for matching the {DOI-author} pairs:

PREPARING THE NAMES

# CrossRef
cr_z <- cr.affils %>% 
  select(DOI, ISSN, given, family, src.title) %>% 
  filter(!is.na(family)) %>% unique() %>% 
# formatting  
  mutate(family=toupper(gsub(" ","", family)), 
         given=toupper(gsub("\\. ","\\.", given))) %>% 
  mutate(name=ifelse(is.na(family)==TRUE|is.na(given)==TRUE,
                     coalesce(family, given),
                     paste0(family," ",given))) %>% 
# function to extract the initials
  mutate(initials=sapply(str_extract_all(given, pattern = "(?<!\\w)\\w"), 
                         function(x) paste0(unlist(x),collapse=""))) %>%  
  select(-given) %>% 
# the probes 1,2,3 are the models described above
  mutate(probe1 = paste0(DOI, "_",family, "_", initials),
         probe2 = paste0(DOI, "_",family),
         probe3 = sapply(str_replace_all(family, pattern="\\W", ""), function(x) unlist(x)))

# And we will do the same for Lens and CrossRef data

## scopus
sc_z <- sc.affils %>% 
  select(DOI, ISSN, src.title, name) %>%
  filter(!is.na(name)) %>% unique() %>% 
  mutate(name=toupper(name)) %>% 
  mutate(family=sapply(str_extract(name, pattern = ".+(?=,)"), 
                       function(x) unlist(x))) %>% 
  mutate(initials=sapply(str_extract(name, pattern = "(?<=, ).+"), 
                          function(x) unlist(x))) %>%
  mutate(initials=sapply(str_extract_all(initials, pattern = "(?<!\\w)\\w"), 
                          function(x) paste0(unlist(x),collapse=""))) %>%
  mutate(probe1 = paste0(DOI, "_",family, "_", initials),
         probe2 = paste0(DOI, "_",family),
         probe3 = sapply(str_replace_all(family, pattern="\\W", ""), function(x) unlist(x)))
## lens
ls_z <- ls.affils %>% 
  select(DOI, ISSN, src.title, "family"=last_name, "initials"=first_name) %>%
  filter(!is.na(family)) %>% unique() %>% 
  mutate(family=toupper(gsub(" ","", family)), 
         initials=toupper(initials)) %>% 
  mutate(initials=sapply(str_extract_all(initials, pattern = "(?<!\\w)\\w"), 
                         function(x) paste0(unlist(x),collapse=""))) %>%
  mutate(probe1 = paste0(DOI, "_",family, "_", initials),
         probe2 = paste0(DOI, "_",family),
         probe3 = sapply(str_replace_all(family, pattern="\\W", ""), function(x) unlist(x)))

In the next step we will create a list of DOIs present in all three datasets.

# no idea how is it working :), but it works correctly  
dois <- Reduce(intersect, list(ls_z$DOI, cr_z$DOI, sc_z$DOI))

MODEL 1. MATCHING {DOI + FAMILY NAME + INITIALS} COMBINATIONS

Now let’s try to match the {DOI-author} pairs using the model with the highest specificity (when an author family name is followed by the initials).

doi_int2 <- 
# merging the extracts  
  ls_z %>% select(DOI, ISSN, src.title, family, probe1) %>% 
  full_join(cr_z %>% select(DOI, ISSN, src.title, family, probe1)) %>% 
  full_join(sc_z %>% select(DOI, ISSN, src.title, family, probe1)) %>% 
# filtering only the DOis that present in 2 databases 
  filter(DOI %in% dois) %>% 
  unique() %>% na.omit() %>% 
# marking their cross-presence  
  mutate(sc = probe1 %in% sc_z$probe1,
         cr = probe1 %in% cr_z$probe1,
         ls = probe1 %in% ls_z$probe1) %>% 
# defining the presence of the {DOI-author} marker in databases
    mutate(status = case_when(
      sc==T & cr==T & ls==T ~ "3.all",
      sc==T & cr==T & ls==F ~ "2.sc.cr",
      sc==T & cr==F & ls==T ~ "2.sc.ls",
      sc==F & cr==T & ls==T ~ "2.cr.ls",
      sc==F & cr==F & ls==T ~ "1.ls",
      sc==T & cr==F & ls==F ~ "1.sc",
      sc==F & cr==T & ls==F ~ "1.cr")
      ) %>% 
  select(ISSN, src.title, DOI, status) %>% unique() %>% 
  # grouping around DOI
  group_by(src.title, DOI) %>% 
  arrange(desc(status)) %>% 
  summarize(formula = paste0(status, collapse=" : ")) %>% ungroup() %>% 
  # counting DOIs with different models arount the journals 
  count(src.title, formula) %>% 
  # there are 27 combinations, so we simplify it more
  mutate(type=case_when(
    formula=="3.all" ~ "A",
    formula!="3.all"& grepl("3.all", formula) ~ "B",
    grepl("2.", formula) & !grepl("3.all", formula) ~ "C",
    !grepl("2.", formula) | !grepl("3.all", formula) ~ "D")) %>% 
  # labels
  mutate(label=ifelse(nchar(src.title)<64,
                      str_wrap(src.title,25),
                      str_wrap(paste0(substr(src.title, 1,64),"..."),25))) %>% 
  # factors
  mutate(type=as.factor(type), type=factor(type, levels=c("D", "C", "B", "A"))) %>% 
  # counting DOI with new models  for journals
  group_by(src.title, label, type) %>% 
  summarize(n=sum(n)) %>% ungroup()

## drawing the plot
ggplot(doi_int2, aes(fill = type, values = n)) +
  geom_waffle(color = "white", size = .25, n_rows = 25, flip = TRUE) +
  facet_wrap(~ label, ncol = 4, strip.position="bottom") +
  scale_x_discrete() + 
  scale_y_continuous(labels = function(x) x * 25, # make this multiplyer the same as n_rows
                     expand = c(0,0)) +
  scale_fill_manual(name=NULL,
                    limits=c("D", "C", "B", "A"), 
                    values=c("grey50", '#ff7f00', '#984ea3', '#4daf4a'),
                    labels=c("no names\nmatch", "names match, but none is\npresent in 3 DBs", 
                             "some author names are\npresent in 3 DBs", "all author names are\npresent in 3 DBs"))+
  labs(title = "PRESENCE OF UNIQUE {DOI-AUTHOR} PAIRS ACROSS LENS, SCOPUS, CROSSREF",
       subtitle = "AUTHOR NAME = {Family name + Initials}",
       x = "2600 DOIs, 1 SQUARE = 1 DOI", y = "PUBLICATIONS (COUNT)", caption="Accessed: June 29, 2019") +
  guides(fill = guide_legend())+
  coord_equal()+
  mytheme+
  theme(legend.position = "bottom",
        legend.justification = c("center", "top"),
        legend.text = element_text(size=rel(1), hjust=0),
        strip.text = element_text(size=rel(0.9), hjust=0),
        strip.background = element_rect(fill="#fafae1", color=NA),
        panel.grid = element_blank(), 
        axis.line.y = element_line(),
        axis.ticks.y = element_line()) +
  ggsave(paste0(dir, "/mtching_authors_model1.png"), 
         width=20, height = 27, units="cm", dpi=300)

One square accounts for one publication, which we separated into 4 groups based on the presence of the article’s {DOI-Family Name-Initials} combinations across all three databases.

  • one group (green) comprises the publications that have identical authors names in the Lens, Scopus, CrossRef records.

  • the group in violet corresponds to the articles that have both the authors whose names are similar across three databases, and the authors wuth the names that do not perfectly match. For instance, there is a missed initital letter for one of the authors in one or few databases. This group comprise most of the error cases we described above.

  • the “orange” group is mainly about the publications that are present in Cyrillic transcription in one database and in Latin transcription in the others. For example, the article may have names in Cyrillic in the Lens and CrossRef and they are likely to match each other, but they will never match the names in Scopus that are written in Latin transcription. So the orange color is not worse than violet (regarless of the order of color perception), these differences are easy to explain.

  • the black squares, which are just few, refer to the publications that have the author names put in so different manner that there is no matching between any two databases.

COMPARING MODELS 1,2,3

Instead of drawing such diagrams for model 2 and model 3, we will make another chart that shows how the ratio between green, violet, orange, and grey publications vary with a change of model.

Some journals like COSMIC RESEARCH, PALEONTOLOGICAL JOURNAL, VESTNIK ST. PETERSBURG UNIVERSITY: MATHEMATICS have only “green” publications with all three models. For other journals removing the initials and non-letter symbols increased a portion of the matched {DOI-AUTHOR} strings. The journals with large shares of “violet” publications seem to have a high level of irregularities (“violet” means that the records of publication contain the author names that are present in all three databases, but there are also some other, irregular name variants that miss in one or two databases).

CONCLUSIONS

  1. In our experiment Scopus overperformed Lens and CrossRef as a source of affiliation information. For 24 selected journals Scopus contained an affiliation info in 95% documents, Lens in 45%, CrossRef in 7% only. The reported ratios are specific for the journals (see Test 1).

  2. Both Lens and Scopus lose a small portion of the affiliations due to the errors made in the original publications, or occurred during their own processing errors. (less than 1%, see Test 2)

  3. Typical processing errors for the Lens records include:

  1. Typical processing errors in Scopus include:
  1. In the randomly selected dataset of publications only 24% of the records had complete affiliations, yet only 25% of those were in Latin transcription in the Lens, the other records that have specific and complete affiliations were in Cyrillic. Other records beared the errors mentioned above.

  2. The selected databases contain different formats of the author family names, given names, and the initials, which complicates a use of the author names for matching or other analysis.

The analysis showed that so far the Lens and CrossRef can’t be regarded as a comprehensive source of affiliation information. These results ask for additional efforts that an academic community need to make in order to stimulate the publishers to prepare richer metadata (and of proper quality) for the journal web sites, PDF files and XMLs supplied to CrossRef and other databases.

ACKNOWLEDGEMENTS

I am grateful to the Lens & CrossRef teams for what they do. Multiple thanks to all the experts, who care about sharing their experience and contribute to the community with free tutorials and kind advices. Love to the dearest R community.

CITATION

Lutay, A.. (2019, July 6). Author and Affiliation Information in 24 Russian Journals in the Lens, Scopus, and CrossRef (Version 1). figshare. <https://doi.org/10.6084/m9.figshare.8786906>

CONTACTS

Twitter

Figshare

REFERENCES

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Hadley Wickham (2018). scales: Scale Functions for Visualization. R package version 1.0.0. https://CRAN.R-project.org/package=scales

Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont

Bob Rudis and Dave Gandy (2019). waffle: Create Waffle Chart Visualizations. R package version 1.0.0. https://github.com/hrbrmstr/waffle/tree/cran

Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.

Yihui Xie, Joe Cheng and Xianying Tan (2019). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.7. https://CRAN.R-project.org/package=DT