Coverage of 24 Russian Journals in Scopus, CrossRef, and the Lens

INTRO

Every time I am reading a press realease from the Lens, a search engine for the patents and scholarly publicatons, I feel myself captured and fascinated with a speed of their developments. Amoung the well-known commercial solutions, the Lens is Hermes, that Greek god usually drawn with the small wings on his boots and helmet. After the last one (June 27) offering a free API and another pack of visually catching tools for analysis, I found myself sending to my friends the messages like “Watch out. A game changer”. With its comprehensive contents, analytical tools, lack of paywall, API and generous export options, the Lens has everything required for doing what many of us expected the senior Olympus inhabitants (GS, MAG, Beta) should have already done - i.e. coming & smoking all those exclusive providers of evidence with their moth-eaten “gold standards”. Better if in a manner demonstrated by incredible Olexandr Usyk in 2018 a cruiserweight absolute champion in boxing.

While it may seem that the academic community is on the right track, signing the declarations and choiring DORA, dona nobis pacem, our ability to change and progress depends on availability of the “good” data. Good stands for “comprehensive and reliable”. Without an access to such data, the yoke of the primitive (one-click, one decision) assessment techniques is only going to become heavier: All candidates should clearly state the (THE, QS, of FT business school) ranking of the university of their highest degree. source.

With these thoughts in mind, I decided to spend a part of my weekend for an excerise designed to answer a question “If the day has really come?”, i.e. whether a quality of the Lens data can be a substitute to the commerical data sources, and, if it can, then for what specific tasks?

As I am currently fortunate to be out of academic necessities, the results will not be sent anywhere (it is not a science, just a technical report), I will publish it here (at this moment, writing “here”, I have no ideas what will it be, a really great word for www). Most likely this study will catch me, so I will have to publish it in parts. Then the other parts will be on RPubs or on Figshare.

MODEL

three sources to compare (CrossRef, Lens, Scopus)
24 journals of Russian origin, selected by the following criteria: indexed from 2017 or earlier date and having DOIs. The particular titles were selected quite randomly to ensure a presence of different publishers & subject areas. No other factors influenced my choice. I declare no conflict of interests.
contents published in 2018 - 2019

Where are WoS, Microsoft Academic, Google Scholar and Dimensions? Well, the case of WoS is plain: I do not have access to it, the national deals are not for every one. MAG has a too sophisticated data export for a quick play. GS can hardly be viewed as a source of data in its current form. And I failed to find a search option for ISSNs in Dimensions (I have an access only to the public version).

The table below is an extact from the Excel spreadsheet listing the titles of Russian origin indexed by Scopus, which is available at the Elsevier Russian web site. I am using ISSN.1 and ISSN.2 instead of the common “print / electronic” classifiers, since I did not check with ISSN.org if the numbers are matched correctly. Scopus.sourceid is a Scopus identifier for the sources (journals, books, etc) that is present in most of Scopus-related spreadsheets and can also be used for searching with SRCID().

DATA COLLECTION

Scopus data: although there is {rscopus} package by John Muschelli (2018). rscopus: Scopus Database ‘API’ Interface. R package version 0.6.3 CRAN, I decided not to spend my monthly quota on this sudy, and exported (CSV) the results of a query as follows: ISSN (20758251 OR 23084057 OR 09599436 OR 00970549 OR 2311911x OR 00172278 OR 10634541 OR 00167029 OR 1026051x OR 00445134 OR 19967756 OR 00241148 OR 23093994 OR 00310301 OR 01316095 OR 20720351 OR 01316397 OR 00360279 OR 10637869 OR 18138691 OR 2078502x OR 00109525 OR 10637745 OR 02023822 OR 00150541 OR 23109599 OR 1364551x OR 23136871 OR 15561968 OR 20718721 OR 24108731 OR 15556174 OR 20720378 OR 23134836 OR 14684829 OR 14684780 OR 18138705 OR 16083075 OR 1562689x OR 15738493) AND PUBYEAR > 2017 AND PUBYEAR < 2020

Lens data: same thing for the Lens, its web UI allows to export JSON or CSV for up to 20k records. The query was source.issn:(20758251 OR 23084057 OR 09599436 OR 00970549 OR 2311911x OR 00172278 OR 10634541 OR 00167029 OR 1026051x OR 00445134 OR 19967756 OR 00241148 OR 23093994 OR 00310301 OR 01316095 OR 20720351 OR 01316397 OR 00360279 OR 10637869 OR 18138691 OR 2078502x OR 00109525 OR 10637745 OR 02023822 OR 00150541 OR 23109599 OR 1364551x OR 23136871 OR 15561968 OR 20718721 OR 24108731 OR 15556174 OR 20720378 OR 23134836 OR 14684829 OR 14684780 OR 18138705 OR 16083075 OR 1562689x OR 15738493) with a filter Year published = (2018-2019). There is no R package for the Lens, but they kindly make a R code example for API requests in the corresponding section code sample.

CrossRef: data was exported as JSON in batches by 1000 records, using the queries like: api.crossref.org/works?filter=issn:2075-8251, issn:2308-4057, issn:0959-9436, issn:0097-0549, issn:2311-911X, issn:0017-2278, issn:1063-4541, issn:0016-7029, issn:1026-051X, issn:0044-5134, issn:1996-7756, issn:0024-1148, issn:2309-3994, issn:0031-0301, issn:0131-6095, issn:2072-0351, issn:0131-6397, issn:0036-0279, issn:1063-7869, issn:1813-8691, issn:2078-502X, issn:0010-9525, issn:1063-7745, issn:0202-3822, issn:0015-0541, issn:2310-9599, issn:1364-551X, issn:2313-6871, issn:1556-1968, issn:2071-8721, issn:2410-8731, issn:1555-6174, issn:2072-0378, issn:2313-4836, issn:1468-4829, issn:1468-4780, issn:1813-8705, issn:1608-3075, issn:1562-689X, issn:1573-8493, from-pub-date:2018, until-pub-date:2019&rows=1000&offset=0 (I added the white spaces after commas to make it looking better in rmarkdown). Again, there is also an extremely well-developed and useful package {rcrossref} by Scott Chamberlain, Hao Zhu, Najko Jahn, Carl Boettiger and Karthik Ram (2019). rcrossref: Client for Various ‘CrossRef’ ‘APIs’. R package version 0.9.2. CRAN, but I felt myself a bit lazy.

Just to make it clear, this report is not about the APIs, but about the underlying data and its quality. And I know no better quote for this topic than this one: “Let’s talk of graves, of worms, and epitaphs…”.

SCOPUS DATA PROCESSING

If you ever downloaded Scopus CSV with funding information, you know this bug - the data from the “Funding Text” column is read in an unpredictable manner and is split into 2 - 3 - 4 -… columns, which could be very annoying if you bind many CSVs. I merged all those fields into one with “unite”.

ff <- list.files(paste0(dir,"/data/scopus/"))
ff<-ff[grepl("scopus",ff)]
# creating an empty dataframe
sc.data <- data.frame()
for (i in 1:length(ff)){
  data <- read_csv(paste0(dir, "/data/scopus/",ff[i]))
  nm<-names(data)
  # uniting the data in columns with the names containing Funding Text
  datax <- data %>% unite(col="Fund_text", nm[grepl("Funding Text",nm)], sep="_")
  sc.data <- rbind(sc.data, datax)
}
# the merged data is in sc.data

Now we have a dataframe with Scopus records to be used further.

## Observations: 2,798
## Variables: 45
## $ Authors                         <chr> "Dokuka V.N., Gostev A.A., Kha...
## $ `Author(s) ID`                  <chr> "9637818200;6701778008;6603089...
## $ Title                           <chr> "Calculaton of voltages induce...
## $ Year                            <dbl> 2018, 2018, 2018, 2018, 2018, ...
## $ `Source title`                  <chr> "Problems of Atomic Science an...
## $ Volume                          <dbl> 41, 41, 41, 41, 41, 41, 41, 41...
## $ Issue                           <dbl> 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, ...
## $ `Art. No.`                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Page start`                    <dbl> 93, 80, 57, 36, 48, 21, 5, 95,...
## $ `Page end`                      <dbl> 104, 92, 79, 47, 56, 35, 20, 1...
## $ `Page count`                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Cited by`                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ DOI                             <chr> "10.21517/0202-3822-2018-41-3-...
## $ Link                            <chr> "https://www.scopus.com/inward...
## $ Affiliations                    <chr> "State Research Center of Russ...
## $ `Authors with affiliations`     <chr> "Dokuka, V.N., State Research ...
## $ Abstract                        <chr> "The goal of the calculations ...
## $ `Author Keywords`               <chr> "Electromagnetic poloidal syst...
## $ `Index Keywords`                <chr> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Molecular Sequence Numbers`    <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Chemicals/CAS`                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ Tradenames                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ Manufacturers                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Funding Details`               <chr> NA, NA, NA, NA, NA, NA, NA, "R...
## $ Fund_text                       <chr> "NA_NA_NA_NA", "NA_NA_NA_NA", ...
## $ References                      <chr> "Coppi, B., Airoldi, A., Bomba...
## $ `Correspondence Address`        <chr> NA, NA, NA, NA, NA, "Markovski...
## $ Editors                         <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ Sponsors                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher                       <chr> "National Research Center Kurc...
## $ `Conference name`               <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Conference date`               <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Conference location`           <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Conference code`               <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ ISSN                            <chr> "02023822", "02023822", "02023...
## $ ISBN                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ CODEN                           <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `PubMed ID`                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA...
## $ `Language of Original Document` <chr> "Russian", "Russian", "Russian...
## $ `Abbreviated Source Title`      <chr> "Probl. At. Sci. Technol. Ser....
## $ `Document Type`                 <chr> "Article", "Article", "Article...
## $ `Publication Stage`             <chr> "Final", "Final", "Final", "Fi...
## $ `Access Type`                   <chr> "Open Access", "Open Access", ...
## $ Source                          <chr> "Scopus", "Scopus", "Scopus", ...
## $ EID                             <chr> "2-s2.0-85056665925", "2-s2.0-...

LENS DATA PROCESSING

The Lens JSON export files provides a bit more data than CSV, but it is also deeply nested, so it took me reading few dozens of SO topics to figure out how to extract this or that pieces. Below is only the part with the basic metadata, the code for the other parts (authors, funding, etc) may appear further in the corresponding paragraphs.

ls.data <- jsonlite::fromJSON(paste0(dir, "/data/lens/lens-export.json"), flatten=TRUE)

meta <- ls.data %>% 
  # selecting the columns with common metadata values 
  select(lens_id, publication_type, title, year_published, 
         source.issn, source.publisher, source.country, source.title_full, 
         volume, issue, start_page, end_page, external_ids) %>% 
  unnest(external_ids, .preserve = source.issn)

At this stage I found out that 26 out of 3241 records have few Microsoft Academic IDs (magid). For example, this article, named “On ‘Age-old Russian Backwardness’” seems to have 3 copies in MAG - copy1, copy2, copy3 - made of PDF, author’s university repository, and the journal’s web page. Apparently, this bit is about someone else’s backwardness. The Lens merged such duplicates, so we need just to be aware of this bug in case if we decide to play with the magids.

# to deal with the records that have few magids, we group & merge them via semicolon   
meta <- meta %>% 
  group_by(lens_id, type) %>% 
  mutate(value=paste0(unique(value), collapse="; ")) %>% 
  ungroup() %>% 
# getting rid of duplicates   
  unique() %>% 
# making the separate columns for DOI, pmid, magid 
  spread(type, value, fill=NA) %>%  
# unnesting the ISSNs and making separate columns for print/electronic 
  unnest(source.issn) %>% 
  spread(type, value, fill=NA)

Below is a dataframe with the basic metadata from the Lens records that we are going to use further.

## Observations: 3,430
## Variables: 17
## $ lens_id                   <chr> "000-101-321-426-866", "000-117-017-...
## $ publication_type          <chr> "journal article", "journal article"...
## $ title                     <chr> "Express-method determination of emu...
## $ year_published            <dbl> 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ source.publisher          <chr> "Ore and Metals Publishing House", "...
## $ source.country            <chr> "Russian Federation", "Russian Feder...
## $ source.title_full         <chr> "Gornyi Zhurnal", "Sel'skokhozyaistv...
## $ volume                    <dbl> NA, 53, 6, 52, 3, NA, 21, 57, NA, 28...
## $ issue                     <chr> NA, "5", "2", "2", NA, "5", "6", "1"...
## $ start_page                <chr> "82", "927", "385", "168", "101", "3...
## $ end_page                  <dbl> 87, 937, 402, 174, 112, 388, 479, 55...
## $ scholarly_citations_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ doi                       <chr> "10.17580/gzh.2018.11.15", "10.15389...
## $ magid                     <chr> "2907898250", "2900599548", "2810106...
## $ pmid                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ electronic                <chr> NA, "23134836", "23136871", "1555617...
## $ print                     <chr> "00172278", "01316397", "2311911x", ...

CROSSREF DATA PROCESSING

CrossRef JSONs were also unnested and merged into one dataframe.

ff<-list.files(paste0(dir,"/data/crossref/"))
ff<-ff[grepl("works",ff)]
cr.data <- data.frame()
for (i in 1:length(ff)){
  cr.json <- jsonlite::fromJSON(paste0(dir, "/data/crossref/",ff[i]), flatten=TRUE)
    # the data is in the items section
  cr.meta <- cr.json$message$items %>%
    select(DOI, type, title, "title.orig"=`original-title`, 
           "pubdate1"=`published-print.date-parts`, "pubdate2"=`published-online.date-parts`,
           "issns"=`issn-type`, publisher, "source.title"=`container-title`, 
           volume, issue, page, "alt.ids"=`alternative-id`, "cites"=`is-referenced-by-count`) %>%
    as_tibble %>% 
    # dealing with NULL values in the lists 
    purrr::modify(~replace(.x,lengths(.x)==0,list(NA))) %>% 
    modify_if(~all(lengths(.x)==1),unlist) %>%
    # extracting the years
    mutate(pubdate1 = sapply(pubdate1, function(x) ifelse(is.null(x), NA, unlist(x[[1]][1]))),
           pubdate2 = sapply(pubdate2, function(x) ifelse(is.null(x), NA, unlist(x[[1]][1])))) %>%
    # pub.year is the earliest date from print.pub.date and online.pub.date 
    mutate(pubdate= pmin(pubdate1, pubdate2, na.rm = T)) %>% 
    # unnesting the ISSNs and making separate columns for print/electronic 
    unnest(issns) %>% 
    # removing "-"
    mutate(value=gsub("-","", value)) %>% 
    spread(type1, value, fill=NA) 
  
  cr.data <- rbind(cr.data, cr.meta)
  # the merged data is in cr.data 
}

This is a dataframe with the basic metadata from CrossRef records that we are going to use further.

read_csv(paste0(dir, "/data/crossref/cr.meta.csv")) %>% glimpse()

## Observations: 3,550
## Variables: 16
## $ DOI          <chr> "10.32607/2075-8251-2017-9-4-84-91", "10.32607/20...
## $ type         <chr> "journal-article", "journal-article", "journal-ar...
## $ title        <chr> "Recombinant Antibodies to the Ebola Virus Glycop...
## $ title.orig   <chr> "Рекомбинантные антитела к гликопротеину вируса Э...
## $ pubdate1     <dbl> 2018, 2018, NA, NA, 2018, 2018, 2019, 2019, 2019,...
## $ pubdate2     <dbl> NA, NA, 2019, 2019, 2018, 2018, 2018, 2018, 2019,...
## $ publisher    <chr> "Acta Naturae Ltd", "Acta Naturae Ltd", "Uspekhi ...
## $ source.title <chr> "Acta Naturae", "Acta Naturae", "Physics-Uspekhi"...
## $ volume       <dbl> 9, 9, NA, NA, 49, 49, 62, 62, 62, 62, 62, 62, 62,...
## $ issue        <chr> "4", "4", NA, NA, "6", "6", "05", "05", "05", "05...
## $ page         <chr> "84-91", "84-91", NA, NA, "372-381", "425-427", N...
## $ alt.ids      <chr> NA, NA, "UFNe.2019.04.038549", "UFNe.2019.04.0385...
## $ cites        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0...
## $ pubdate      <dbl> 2018, 2018, 2019, 2019, 2018, 2018, 2018, 2018, 2...
## $ electronic   <chr> NA, NA, "14684780", "14684780", "15738493", "1573...
## $ print        <chr> "20758251", "20758251", "10637869", "10637869", "...

TEST 1. ISSNs & UNIQUE JOURNAL TITLES

First of all, we need to brush up the identificators for journals & articles.

DOIs are to be converted to the lower case and trimmed for the white spaces.
ISSNs will be uppercased and also trimmed.

cr.meta <- read_csv(paste0(dir, "/data/crossref/cr.meta.csv")) %>%
  select(DOI, title, "src.title"=source.title, type, publisher, 
         volume, issue, page, "year"=pubdate, cites, print, electronic) %>%
  filter(type!="journal-issue"&type!="journal issue") %>% 
  mutate(DOI=trimws(tolower(DOI)),
         print=toupper(print), 
         electronic=toupper(electronic),
         src.title=trimws(toupper(src.title))) %>% 
  separate(page, sep="-", into=c("page1", "page2")) %>% unique()

ls.meta <- read_csv(paste0(dir, "/data/lens/ls.meta.csv")) %>%
  select(lens_id, DOI="doi", title, src.title="source.title_full",
         "year"=year_published, "type"=publication_type, "publisher"=source.publisher,
         "cites"=scholarly_citations_count, volume, issue, 
         "page1"=start_page, "page2"=end_page, print, electronic) %>%
  filter(type!="journal-issue"&type!="journal issue") %>% 
  mutate(DOI=trimws(tolower(DOI)),
         print=toupper(print), 
         electronic=toupper(electronic),
         src.title=trimws(toupper(src.title))) %>% unique()

sc.meta <- read_csv(paste0(dir, "/data/scopus/sc.data.csv"), 
                    col_types = cols(ISSN = col_character())) %>% 
  select(DOI, "title"=Title, "year"=Year, "volume"=Volume, "issue"=Issue, 
         "page1"=`Page start`, "page2"=`Page end`,"cites"=`Cited by`, 
         "publisher"=Publisher, ISSN, "pmid"=`PubMed ID`, 
         "type"=`Document Type`, "src.title"=`Source title`) %>% 
  mutate(DOI=trimws(tolower(DOI)), 
         ISSN=toupper(ISSN), 
         src.title=trimws(toupper(src.title)))  %>% unique() 

cat(paste0("CrossRef data contains ", n_distinct(cr.meta$src.title), " unique titles\n",
           "Lens data contains ", n_distinct(ls.meta$src.title), " unique titles\n",
           "Scopus data contains ", n_distinct(sc.meta$src.title), " unique titles\n"))

## CrossRef data contains 24 unique titles
## Lens data contains 26 unique titles
## Scopus data contains 24 unique titles

Lens data contains 26 unique titles, two journals have 2 versions of the titles:

ls.meta %>% select(print, electronic, src.title) %>% unique() %>% 
  add_count(print, electronic) %>% filter(n>1) %>% arrange(print) %>% select(-n)

Not a big deal, as the databases may have the different titles, we are going to normalize them anyway and link to unique ISSNs. We can either link print/electronic to ISSN-L (but this requires some additional work) or use the ISSNs provided by Scopus (since Scopus CSV contains only one number). For this study let’s chose again the easiest way and link to the ISSNs provided by Scopus.

cr.meta<-cr.meta %>% 
  mutate(ISSN=ifelse(print %in% sc.meta$ISSN, print,
                   ifelse(electronic %in% sc.meta$ISSN, electronic, NA)))

ls.meta<-ls.meta %>% 
  mutate(ISSN=ifelse(print %in% sc.meta$ISSN, print,
                     ifelse(electronic %in% sc.meta$ISSN, electronic, NA)))

sc.meta %>% 
  select(ISSN, "Scopus.Source.Title"=src.title) %>% unique() %>% 
  left_join(.,cr.meta %>% 
              select(ISSN, "CrossRef.Source.Title"=src.title) %>% unique()) %>% 
  left_join(.,ls.meta %>% 
              select(ISSN, "Lens.Source.Title"=src.title) %>% unique()) %>% 
  datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 10,
                         lengthMenu = c(10, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '300px',  targets =c(1:3)))))

The Lens names are almost identical to those from CrossRef, except for few journals. VESTNIK ST. PETERSBURG UNIVERSITY: MATHEMATICS in the Lens have both name variants (as in CrossRef and as in Scopus). This is what happens when the data are aggregated from the multiple sources.

Another unimportant observation is that the publication records in the Lens json contained more ISSNs than those in CrossRef. For unknown reason CrossRef json lacked 1364551X, an electronic ISSN for “Mendeleev Communications”, while the CrossRef record for the journal contains this ISSN.

Since both CrossRef and Lens have some source titles in Cyrillic, let’s agree for convinience to use the journal titles in Latin, as provided by Scopus.

TEST 2 - UNIQUE DOIs OR UNIQUE TITLES

Now, we are ready to match the records from different sources for each specific journal and check the coverage. A presence of the publications can be different, as each database has its own way of collecting the data and own rules for updates. CrossRef requires the publishers to register the DOIs that become available instantly. Scopus grabs the data from the journal web site or gets XML files from the publisher. Although Scopus proudly claims to update the records on a daily basis, there must be a delay caused by processing. The Lens gets the data from CrossRef & Microsoft Academic Graph and process the records, so there must also be some gap. CrossRef and Lens jsons provdes the dates when the records were added to the database, so anyone can check their velocity.

# preparing a list of unique DOIs
dois <- c(cr.meta$DOI, sc.meta$DOI, ls.meta$DOI) %>% na.omit() %>% unique()

# using Upset diagram to show the intersections
dd <- list(Scopus=which(dois %in% sc.meta$DOI),
           CrossRef=which(dois %in% cr.meta$DOI),
           Lens=which(dois %in% ls.meta$DOI))

UpSetR::upset(fromList(dd), main.bar.color = "grey30", sets.bar.color = "gray30", 
      matrix.color = "grey30", empty.intersections = "on",
      order.by = "freq", number.angles = 0, point.size = 4.5, line.size = 1.2,
      mainbar.y.label = "PUBLICATONS", sets.x.label = "", 
      mb.ratio = c(0.65, 0.35),
      text.scale = c(1.5, 1.3, 1.5, 1.3, 1.5, 2),
      queries = list(
        list(query = intersects, params = list("Scopus","CrossRef", "Lens"), color = "grey70", active = T), 
        list(query = intersects, params = list("Scopus", "CrossRef"), color = "#56B4E9", active = T), 
        list(query = intersects, params = list("Scopus", "Lens"), color = "#009E73", active = T), 
        list(query = intersects, params = list("Lens", "CrossRef"), color = "#F0E442", active = T),
        list(query = intersects, params = list("Scopus"), color = "#D55E00", active = T),
        list(query = intersects, params = list("CrossRef"), color = "#0072B2", active = T),
        list(query = intersects,params = list("Lens"), color="#CC79A7", active = T)))

2610 DOIs are present in all three databases, 727 are absent in Scopus, 141 are in CrossRef only, 39 are absent in the Lens, and… 74 are present in Scopus only. What? Smells like a bug.

DOIs UNIQUE FOR SCOPUS

From reviewing those 74 unique DOIs the speakers of “How to publish in the prominent journals” can find a lot of inspiration and materials.

Some publications are actually present in CrossRef and Scopus, but one database have the wrong DOI (guess which?). For instance, the publicaton “Modern diagnostics for investigation of lithium element behavior in tokamaks” from the journal “PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, SERIES THERMONUCLEAR FUSION” have this DOI 10.21517/0202-3822-2018-41-1-35-40 in CrossRef, and this one 10.21517/0202-3822-2017-41-1-35-40 in Scopus. 2017 vs. 2018. My initial suggestion was that PDF contans DOI with 2017, so Scopus took this error from the file. But the PDF file is OK. The web site and TOC page does not contan any metadata for the articles. The source of the error remains unknown, but all publications of the volume 41 have wrong numbers. Apparently, it’s either the editorial office made an error in XML (the web site suggests that there is a lot of manual work in their editorial workflow), or the error was made on Scopus side. Whatever was a cause, this case raises a question: how come is that Scopus do not check a validity of the DOIs they obtain before putting them into the database?

Another example of this kind is the publication Structure of the mantle and tectonic zoning of the central Alpine-Himalayan Belt from the journal “GEODYNAMICS AND TECTONOPHYSICS”. This DOI 10.5800/gt-2018-9-4-0386 is from Scopus, and this is from CrossRef: 10.5800/gt-2018-9-4-0387. Guess, which one is valid? This time, even though the journal is published with Open Journal System (specifically designed to automate the editorial workflow), the PDF bears a wrong DOI number. Oh, that backwardness.

And the last example, the publication Current state of high-accuracy laser ranging from the journal “PHYSICS-USPEKHI”. Crossref DOI: 10.3367/ufne.2017.04.038147, Scopus DOI: 10.3367/ufnr.2017.04.038147. Both are registered in CrossRef, but the DOI with “ufnr” refers to the Russian version of the journal (“Uspekhi Fizicheskikh Nauk”), published by the Russian editorial board, while the DOI with “ufne” is linked to the translated version (English edition), which is published by Turpion Ltd. and distributed by IOP. One may say “So what’s the problem? The CrossRef record for unfr-DOI should contain both English title and Russian original title”, and I did check, it contains both. The problem appears to be that quering two databases with one set of ISSNs bring the researcher to the different sets of DOIs, requiring an additional efforts for de-duplication & matching. I hope they have a better accuracy with the lasers.

Although we observed that Lens combined magids, for this publication the Lens have only one record with 1 DOI link and no magid. I found the publication in MAG (2811484448), also linked to ufne-DOI, what helped me to find another record in the Lens link, lacking DOI and missing in the search results (which suggests that the record does not have ISSN as well). Wow!

I used {stringdist} package to check if those 74 publication titles with Scopus-unique DOIs have their clones (with identical or close title) in the CrossRef dataset.

require(stringdist)

uniquescopus <- sc.meta %>%  filter(!is.na(DOI)) %>%  filter(!DOI %in% cr.meta$DOI) %>% mutate(sim=0)

for (i in 1:NROW(uniquescopus)){
  # finding a position of the closest (in Levenstein sense) title in the CrossRef dataset 
  k<-which.min(stringdist(tolower(uniquescopus$title[i]), tolower(cr.meta$title)))
  # adding a similarity score (1 means the records are "identical")
  uniquescopus$sim[i]<-stringsim(tolower(uniquescopus$title[i]), tolower(cr.meta$title[k]))
  print(i)
}

# the titles that have score >0.95 can be count the same articles   
uniquescopus %>% 
  mutate(similar.title=ifelse(sim>0.95, "EXISTS ", "NOT FOUND")) %>% 
  write_excel_csv(paste0(dir, "uniquescopus.csv"))

Now let’s see what we have:

read_csv(paste0(dir, "uniquescopus.csv")) %>% 
  count(src.title, year, type, similar.title) %>% 
  spread(year, n, fill="") %>% 
  setNames(c("JOURNAL","DOCTYPE","VERSION IN CROSSREF","PUBYEAR=2018", "PUBYEAR=2019")) %>% 
  datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 15,
                         lengthMenu = c(5,15, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '70px',  targets =c(3:4))))) %>% 
  formatStyle(3:5, `text-align` = "center") %>%
  formatStyle(1:2, `text-align` = "left")

For some journals like PHYSICS-USPEKHI or PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, SERIES THERMONUCLEAR FUSION, the articles are present in Scopus, but DOIs are not found, and as those DOIs refer to 2018 publications, we can suggest that Scopus bears the incorrect DOIs.

In case of VESTNIK ST. PETERSBURG UNIVERSITY: MATHEMATICS the DOIs for the publications in the volume 52, issue 2 (2019) are not still registered in CrossRef, but already present in Scopus. This needs an additional explanation. Some Russian journals are published by the universities, and due to the red tape they experience difficulties to pay CrossRef in EUR for membership and DOIs. So the universities buy DOIs from the third party service providers together with the contents registration services - instead of advancing their own skills and understanding, they have to rely on some other people who were lucky to win the tender for services this year. This leads to situations when one group of people assign DOIs and publish the documents online, and other people need to remember to register them.

Or to the worse accidents, as with NEUROSCIENCE AND BEHAVIORAL PHYSIOLOGY. The DOIs for publications in the volume 48, issue 1 are absent in CrossRef. I witness it today, on Saturday, June 29, 2019. Was it caused by a change of provider in 2018, or some budget casualties, there is a matter of fact - the contents was published, DOIs were not registered, Scopus grabbed the contents from the web site, unregistered DOIs are now in the database. Speed matters, right?

There are 7 publications from the journal SEL’SKOKHOZYAISTVENNAYA BIOLOGIYA that present only in Scopus search results. One day the publisher decided to assign 2 DOIs to the publications - one for English 10.15389/agrobiology.2018.4.687eng and the other for Russian version 10.15389/agrobiology.2018.4.687rus. CrossRef and Lens have DOIs for the Russian versions, Scopus - only the English version.

One “unique” DOI for DIABETES MELLITUS 10.14341/dm9392 is due to a wrong DOI that copied from the other publication. The correct DOI that is “missing” is 10.14341/dm9687.

The “unique” DOI for ZOOLOGICHESKII ZHURNAL is also an error. DOI in Scopus is 10.1134/s00.445134181200.48, while the correct DOI (present in CrossRef and Lens) is 10.1134/s0044513418120048.

Apart from that, Scopus search results contained 13 duplicated records for the FIBRE JOURNAL (issue 2, volume 50, published July 2018), present both as Article and as Article in Press.

sc.meta %>% add_count(DOI) %>% select(DOI, src.title, type, n) %>% na.omit() %>%  filter(n>1) %>% 
  mutate(n=!is.na(n)) %>% unique() %>% arrange(DOI) %>% spread(type,n) %>% na.omit()

Summing up, what may look as a Scopus-unique DOI is a result of one of 3 casualties:

technical errors (wrong DOIs)
unregistered DOIs (missed or delayed)
duplicated DOI records for Russian & English editions

DOIs THAT UNIQUE FOR LENS

As the Lens obtains data from CrossRef, it does not have many Lens-unique piblications. There is only one - 10.1134/s1063774517070100 and it’s a bug. The article was published in December 2017, but the Lens sees it as of 2018. That’s it, the shortest paragraph, I think.

DOIs THAT SCOPUS IS MISSING

As we found out, Scopus can index the DOIs before they are registered. Now let’s see what publications is missed in Scopus, while present in the Lens and CrossRef. This should be a tough test. One may suggest that there must be many fresh publications that were registered in CrossRef, but not indexed by Scopus for any reasons. Ok, then we will check if they have already appeared in the Lens.

scmis<-cr.meta %>% filter(DOI %in% ls.meta$DOI) %>% filter(!DOI %in% sc.meta$DOI) %>% unique() %>% 
  select(-src.title) %>% left_join(sc.meta %>% select(src.title, ISSN) %>% unique() %>% na.omit())

scmis %>% count(src.title, ISSN, year) %>% spread(year, n, fill=0) %>% arrange(desc(`2018`)) %>% 
   datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 10,
                         lengthMenu = c(5,10, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '70px',  targets =c(2:3))))) %>% 
  formatStyle(2:4, `text-align` = "center") %>%
  formatStyle(1, `text-align` = "left")

Our suggestion was incorrect. Quite the contrary, 463 publications in 2018 and only 264 in 2019.

Let’s review few cases:

SEL’SKOKHOZYAISTVENNAYA BIOLOGIYA

As we already found out this journal assigned 2 DOIs to some of their articles - one ended with “eng”, and the other ended with “rus”. There are 316 DOIs in the CrossRef and Lens datasets, and only 142 in Scopus. In order to find out the real ratio, let’s get rid of the ending and take into account the duplicates.

First, CrossRef:

cr.x<- cr.meta %>% filter(ISSN=='01316397') %>% select(year, DOI) %>% 
  mutate(end = sapply(DOI, function(x) unlist(str_extract(x, pattern="[a-z]{3}$")))) %>% 
  mutate(DOI = sapply(DOI, function(x) unlist(str_replace(x, pattern="[a-z]{3}$", replacement = "")))) 

cr.x %>% add_count(DOI) %>% count(DOI, year, end, n) %>% 
  mutate(type=case_when(n==2 ~"2 versions",n==1 ~ end)) %>% 
  select(-end) %>% unique() %>% count(year, type) %>% 
  datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>% 
  formatStyle(1:3, `text-align` = "center")

Aha, except for 6 publications in 2018, the other publications are present in CrossRef as duplicates - Russian and English editions. So the actual number of articles is not 316, but almost half of that - 161. The data from Lens are identical.

Now, Scopus:

sc.x<- sc.meta %>% filter(ISSN=='01316397') %>% select(year, DOI) %>% 
  mutate(end = sapply(DOI, function(x) unlist(str_extract(x, pattern="[a-z]{3}$")))) %>% 
  mutate(DOI = sapply(DOI, function(x) unlist(str_replace(x, pattern="[a-z]{3}$", replacement = "")))) 

sc.x %>% add_count(DOI) %>% count(DOI, year, end, n) %>% 
  mutate(type=case_when(n==2 ~"2 versions",n==1 ~ end)) %>% 
  select(-end) %>% unique() %>% count(year, type) %>% 
  datatable(escape=FALSE,rownames = FALSE, options = list(dom = 't')) %>% 
  formatStyle(1:3, `text-align` = "center")

Well, the total number is a bit lower - 142, but Scopus data does not have the duplicated versions.

ZOOLOGICHESKII ZHURNAL

full_join(sc.meta %>% filter(ISSN=="00445134") %>% 
            select(DOI, year, volume, issue) %>% 
            group_by(year, issue) %>% summarize(Scopus=n_distinct(DOI, na.rm = T)) %>% 
            ungroup() %>% mutate(issue=as.numeric(issue)),
          cr.meta %>% filter(ISSN=="00445134") %>% select(DOI, year, volume, issue) %>% 
            group_by(year, issue) %>% summarize(CrossRef=n_distinct(DOI, na.rm = T)) %>% 
            ungroup() %>% mutate(issue=as.numeric(issue))) %>% 
  gather(3:4, key="Source", value="Publications") %>% 
  arrange(year, issue) -> xxx 

xxx %>% 
  mutate(labels=paste0(year,".",sprintf("%02d", issue))) %>% 
  ggplot(aes(x=labels, y=Publications))+
  geom_bar(aes(fill=Source), stat="identity", position="dodge2", width=0.6, color="grey20", size=0.3)+
  scale_y_continuous(expand = expand_scale(add=0))+
  scale_fill_manual(values=c("#0072B2","#D55E00"), name="SOURCE")+
  labs(title="NUMBER OF UNIQUE PUBLICATIONS IN SCOPUS vs. CROSSREF",
       subtitle="ISSN: 0044-5134, PUBYEAR: 2018-2019",
       caption="Accessed: June 29, 2019", y="PUBLICATIONS", x="YEAR.ISSUE")+
  mytheme+
  theme(axis.text.x=element_text(hjust=1, angle=20),
        panel.grid.major.y = element_line(size=0.3, color="grey80", linetype=3))+
  ggsave(paste0(dir, "/issn00445134_scopus_vs_crossref.png"), 
         width=20, height = 8, units="cm", dpi=300, bg="transparent")

This time it looks more like Scopus have ceased indexing this journal, but Scopus web site shows the journal as actively indexed. In similar cases that I heard about from the publishers, the reason of large delay was a changed of the journal web site - the indexing robots were confused and it took time for Scopus to start indexing the journal again. So far Scopus lists just 1/3 of this journal’s 2018-2019 publications.

ACTA NATURAE

Once referred as a Russian Nature, this journal offers another interesting case (basket of errors) for our study.

DOI-based counting of the publications shows that Scopus has significantly less publications (above), the reason is trivial - the publisher started to assign DOIs in 2019, so 2018 articles when Scopus indexed them did not have DOIs. Crossref confirms this.

According to the title-based counting (bottom), a total number of unique titles is almost equal - 51 in Scopus, 52 in Crossref, but the underlying picture is more complex. Scopus seems to lack the publications from volume 9, they may bbe indexed in Scopus as 2017 records.

There are also 9 CrossRef records with no volume and issue information (blue column named 2018.NA.NA), all of them are present amoung Scopus search results.

title2 <- cr.meta %>% filter(ISSN=='20758251') %>% 
  select(DOI, title, year, volume, issue) %>% 
  add_count(title) %>% filter(is.na(volume)) %>% select(title)

rbind(
  cr.meta %>% filter(title %in% title2$title) %>% 
    mutate(source="Crossref") %>% select(source, DOI, title, volume, issue), 
  sc.meta %>% filter(tolower(title) %in% tolower(title2$title)) %>% 
    mutate(source="Scopus") %>% select(source, DOI, title, volume, issue)) %>% 
  arrange(title, source) %>% 
   datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 5,
                         lengthMenu = c(5,10, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '70px',  targets =c(3:4))))) %>% 
  formatStyle(4:5, `text-align` = "center") %>%
  formatStyle(1:3, `text-align` = "left")

Four records are duplicated in CrossRef with incorrect DOIs that also present in Scopus dataset. One record (or maybe more) also has a duplicate with a Russian title, which converts a whole matching procedure in a nightmare. I wonder, how on Earth this can be resolved by anyone who do not understand Russian language.

Wanna try? Ok, this is how one publication is present in all three databases:

rbind(
  cr.meta %>% filter(grepl("Fusariu", title)) %>% 
    select(title, DOI, year, volume, issue, page1, page2) %>% mutate(source="Crossref"),
  sc.meta %>% filter(grepl("Fusariu", title)) %>% 
    select(title, DOI, year, volume, issue, page1, page2) %>% mutate(source="Scopus"),
  ls.meta %>% filter(grepl("Fusariu", title)) %>% 
    select(title, DOI, year, volume, issue, page1, page2) %>% mutate(source="Lens")) %>% 
  arrange(source) %>% 
   datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 5,
                         lengthMenu = c(5,10, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '50px',  targets =c(2:6)),
                                           list(width = '150px',  targets =c(1))))) %>% 
  formatStyle(3:7, `text-align` = "center") %>%
  formatStyle(1:2, `text-align` = "left")

It’s hard to say what potatos are more accurate in this soup. Scopus records contain no duplicates, but contain incorrect (or no) DOIs. Enough of these “acta horrores”.

RUSSKAYA LITERATURA

## picture 1
sc.meta %>% filter(ISSN=="01316095")  %>% mutate(volume=coalesce(volume, issue)) %>% 
            select(DOI, year, volume, issue) %>% 
            group_by(year, volume) %>% summarize(Scopus=n_distinct(DOI, na.rm = T)) %>% 
            ungroup() %>% mutate(volume=as.numeric(volume)) %>% 
  full_join(ls.meta %>% filter(ISSN=="01316095")  %>% 
            select(DOI, year, volume, issue) %>% 
            group_by(year, volume) %>% summarize(Lens=n_distinct(DOI, na.rm = T)) %>% 
            ungroup() %>% mutate(volume=as.numeric(volume))) %>% 
  full_join(cr.meta %>% filter(ISSN=="01316095") %>% select(DOI, year, volume, issue) %>% 
            group_by(year, volume) %>% summarize(CrossRef=n_distinct(DOI, na.rm = T)) %>% 
            ungroup() %>% mutate(volume=as.numeric(volume))) %>% 
  gather(3:5, key="Source", value="Publications") %>% 
  arrange(year, volume) %>%  
  mutate(labels=paste0(year,".",sprintf("%02d", volume))) %>% 
  ggplot(aes(x=labels, y=Publications))+
  geom_bar(aes(fill=Source), stat="identity", position="dodge2", width=0.6, color="grey20", size=0.3)+
  scale_y_continuous(expand = expand_scale(add=0))+
  scale_fill_manual(values=c("#0072B2", "#CC79A7", "#D55E00"), name="SOURCE")+
  labs(title="NUMBER OF UNIQUE PUBLICATIONS IN CROSSREF, LENS & SCOPUS",
       subtitle="ISSN: 0131-6095, PUBYEAR: 2018-2019",
       caption="Accessed: June 29, 2019", y="PUBLICATIONS", x="YEAR.ISSUE")+
  mytheme+
  theme(panel.grid.major.y = element_line(size=0.3, color="grey80", linetype=3))+
  ggsave(paste0(dir, "/issn01316095_scopus_vs_crossref.png"), 
         width=20, height = 8, units="cm", dpi=300, bg="transparent")

## picture 2
cr.meta %>% filter(ISSN=='01316095') %>% 
  mutate(page1=as.numeric(page1), page2=as.numeric(page2)) 
  mutate(labels=paste0(year,".",sprintf("%02d", volume))) %>%
  group_by(labels) %>% arrange(year, volume, page1, page2) %>% 
  mutate(yind=row_number()) %>% ungroup() %>%
  mutate(scp=case_when(DOI %in% sc.meta$DOI ~ "PRESENT IN BOTH",
                       !DOI %in% sc.meta$DOI ~ "MISSED IN SCOPUS")) %>%  
  ggplot()+
    geom_tile(aes(y=labels, x=yind, fill=scp),color="white", size=1)+
    coord_fixed()+
  scale_x_continuous(expand = expand_scale(add=0))+
  scale_fill_manual(values=c("grey80", "#0072B2"), name="SOURCE")+
  labs(title="JOURNAL PUBLICATIONS INDEXED BY CROSSREF & SCOPUS",
       subtitle="ISSN: 0131-6095, PUBYEAR: 2018-2019",
       caption="Accessed: June 29, 2019", y="YEAR.VOLUME", x="1 SQUARE = 1 PUBLICATION")+
  mytheme+
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())+
  ggsave(paste0(dir, "/issn01316095_scopus_vs_crossref_grid.png"), 
         width=20, height = 8, units="cm", dpi=300, bg="transparent")

Could there be any explanation of why some articles are not indexed?

RUSSIAN JOURNAL OF FOREST SCIENCE

A small journal from Siberia, same problems with delayed indexation.

GORNYI ZHURNAL

Some issues are just absent. Not a few.

In addition, among the Scopus search results there are some articles without DOIs that absent in CrossRef. Many of them do not look like the research articles, so my personal suggestion is that the publisher did not assign DOIs on purpose. These publications are present in Scopus and some of them have a type “research article”. Phenomenon of Belarusian Potash Company, or how to become a market leader is barely a research article (whoever was the sponsor).

sc.meta %>% filter(ISSN=='00172278') %>% filter(!DOI %in% cr.meta$DOI) %>% select(year, title, type, DOI) %>% 
  arrange(desc(title)) %>% 
   datatable(escape=FALSE,rownames = FALSE, 
          options = list(pageLength = 5,
                         lengthMenu = c(5,10, 25),
                         autoWidth = TRUE,
                         columnDefs = list(list(width = '500px',  targets =c(1)),
                                           list(width = '100px',  targets =c(2,3))))) %>% 
  formatStyle(2:4, `text-align` = "left") %>%
  formatStyle(1, `text-align` = "center")

UNIQUE PUBLICATIONS BY TITLE

As the reviewed cases demonstrated, mtaching the contents from the selected databases by means of DOIs is not perfect. The titles can be even worse identifier though. We already observed that CrossRef & Lens tend to store the original title (in our study this means “in Cyrillic”), while Scopus takes the English version of the title. So the discrepancy will be more striking.

The illustration below shows the variety of article titles in the selected databases (the titles were normalized by extracting only the letters, digits and white space in lowered case).

sc.meta$titlex <- sapply(
  str_extract_all(sc.meta$title, pattern="[[:alnum:]\\s]+", simplify = F),
  function(x) tolower(paste0(unlist(x), collapse="")))

ls.meta$titlex <- sapply(
  str_extract_all(ls.meta$title, pattern="[[:alnum:]\\s]+", simplify = F),
  function(x) tolower(paste0(unlist(x), collapse="")))

cr.meta$titlex <- sapply(
  str_extract_all(cr.meta$title, pattern="[[:alnum:]\\s]+", simplify = F),
  function(x) tolower(paste0(unlist(x), collapse= "")))

titles <- c(cr.meta$titlex, sc.meta$titlex, ls.meta$titlex) %>% unlist() %>% 
  na.omit() %>% unique()

# using Upset diagram to show the intersections
dd <- list(Scopus=which(titles %in% sc.meta$titlex),
           CrossRef=which(titles %in% cr.meta$titlex),
           Lens=which(titles %in% ls.meta$titlex))

UpSetR::upset(fromList(dd), main.bar.color = "grey30", sets.bar.color = "gray30", 
              matrix.color = "grey30", empty.intersections = "on",
              order.by = "freq", number.angles = 0, point.size = 4.5, line.size = 1.2,
              mainbar.y.label = "PUBLICATONS", sets.x.label = "", 
              mb.ratio = c(0.65, 0.35),
              text.scale = c(1.5, 1.3, 1.5, 1.3, 1.5, 2),
              queries = 
    list(list(query = intersects, params = list("Scopus","CrossRef", "Lens"), color = "grey70", active = T), 
    list(query = intersects, params = list("Scopus", "CrossRef"), color = "#56B4E9", active = T), 
    list(query = intersects, params = list("Scopus", "Lens"), color = "#009E73", active = T), 
    list(query = intersects, params = list("Lens", "CrossRef"), color = "#F0E442", active = T),
    list(query = intersects, params = list("Scopus"), color = "#D55E00", active = T),
    list(query = intersects, params = list("CrossRef"), color = "#0072B2", active = T),
    list(query = intersects,params = list("Lens"), color="#CC79A7", active = T)))

336 titles that unique for the Lens can be surprising, if we recall a high resemblance between CrossRef and Lens records that we observed earlier. What are those differences? Let’s review few examples:

10.1016/j.mencom.2018.03.039 have a title “Diffusivity of crude oils contained in macroporous medium: 1 H NMR study” both in Scopus and CrossRef, and in the Lens the white space between 1 and H is missing, so the title is “Diffusivity of crude oils contained in macroporous medium: 1H NMR study”.
10.1134/s1063774518050322 have the opposite problem - Scopus and CrossRef have a title “Novel Polyoxovanadate K2ZnV5O14: Crystal Structure and Peculiarities of Crystal Chemistry”, but the the Lens added extra spaces around the chemical formula “Novel Polyoxovanadate K 2 ZnV 5 O 14 : Crystal Structure and Peculiarities of Crystal Chemistry”
10.1134/s0044513418070152 in the Lens has a neat title in Cyrillic: “Первое описание личинки рода Clinterocera Motschulsky (Coleoptera, Scarabaeidae, Cetoniinae)”. CrossRef title reminds us that the editors are human, who are usually helpless with the breaks & tabs - “Первое описание личинки рода\n \n Clinterocera\n \n Motschulsky (Coleoptera, Scarabaeidae, Cetoniinae)”. The record is absent in Scopus.
10.15826/qr.2018.2.302 has similar spelling of the title in CrossRef & Scopus: “The Plans for the Abolition of the Zaporozhian Host and their Implementation (1740s–1770s): Cossack Ambitions vs Imperial Interests”, but the Lens somehow got a title in Cyrillic: “Процесс упразднения Войска Запорожского Низового: амбиции казачества vs интересы империи (1740–1770-е гг.)”.
10.1134/s1063774518070118 contains a motif “Bi 1 – x Pr x FeO 3” in Scopus, “Bi1 – xPrxFeO3” in CrossRef & the Lens.
**10.15389/agrobiology.2018.6.1142eng* is present only in CrossRef and the Lens. It is diffiult even to suggest where did the latter get an extra white space for the line: “..MUTATION HCD DOES NOT IMPACT MILK PRODUCTIVITY…”? Between Impact and Milk. Sounds like a great title for a publication in “Scientometrics”.
and the last one, 10.1134/s1063774518010182. Find 101 differences.

Scopus variant: “Nanostructured Crystals of Fluorite Phases Sr1 – x RxF2 + x and Their Ordering: 12. Influence of Structural Ordering on the Fluorine-Ion Conductivity of Sr0.667 R 0.333F2.333 Alloys (R = Tb or Tm) at Their Annealing”. CrossRef variant: “Nanostructured ... Phases Sr1 – xR\n x\n F2 + x and...Conductivity of Sr0.667R0.333F2.333 Alloys (R = Tb or Tm) at...”.

Lens version: “Nanostructured … Phases Sr 1 – x R x F 2 + x and … Conductivity of Sr 0.667 R 0.333 F 2.333 Alloys ( R = Tb or Tm) at…”.

CONCLUSIONS

Scopus tend to capture the Englih version of the titles, while CrossRef and Lens are stick to the original version (which in our study was often Cyrillic). In the last years Scopus seem to capture the non-English titles as well, but this does not seem to be a strict policy.
when the titles are present in one language, they could be modified with the extra white spaces or other invisible symbols (e.g. line breaks, tabs, carriage returns). Such distortions can appear in any database, seemingly more frequent in those with the chemical formulae, brackets, or special symbols.
Scopus indexes the journals directly and, apparently, without checking the DOIs for validity. Or let me put it in another way, Elsevier allow the editors of >20k journals spoil the quality of Scopus data. What?
The coverage of 2019 journal contents in CrossRef and Lens is almost identical, confirming the regular updates claimed by the latter. The coverage of some titles in Scopus can be very low.
The Lens seems to make a great job in cleaning the errors that the publishers upload to the CrossRef (like invisible symbols) and merging the duplicated versions of the publications created by the MAG, yet their remedies are also not 100% effective.

This part is done, I am still commited to test these datasets further and find out which database suits better for specific & practical tasks. It may take a week, I guess.

ACKNOWLEDGEMENTS

I am grateful to the Lens & CrossRef teams for what they do. Multiple thanks to all the experts, who care about sharing their experience and contribute to the community with free tutorials and kind advices. Love to the dearest R community.

CITATION

DOI

CONTACTS

Twitter

Figshare

REFERENCES

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Scott Chamberlain, Hao Zhu, Najko Jahn, Carl Boettiger and Karthik Ram (2019). rcrossref: Client for Various ‘CrossRef’ ‘APIs’. R package version 0.9.2. https://CRAN.R-project.org/package=rcrossref

John Muschelli (2018). rscopus: Scopus Database ‘API’ Interface. R package version 0.6.3. https://CRAN.R-project.org/package=rscopus

van der Loo M (2014). “The stringdist package for approximate string matching.” The R Journal, 6, 111-122. <URL:https://CRAN.R-project.org/package=stringdist>.

Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont

Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.

Yihui Xie, Joe Cheng and Xianying Tan (2019). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.7. https://CRAN.R-project.org/package=DT

Nils Gehlenborg (2019). UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. R package version 1.4.0. https://CRAN.R-project.org/package=UpSetR