Digital Object Identifiers (DOIs) are essential for academic research, providing a persistent link to digital content. Two key platforms related to DOIs are DOI.org and CrossRef, but they serve different purposes and have varying scopes. However, in terms of DOI resolving, they should resolve the same DOIs - the counts in CrossRef should match the counts in DOI.org. This blog post aims to analyze how many DOIs can be found on each and why these differences exist.
DOI.org is the official site of the International DOI Foundation (IDF). It serves as a platform for resolving DOIs from any registration agency. Users can search and resolve DOIs regardless of the agency that registered them.
DOI.org acts as a global directory, connecting users to resources associated with any DOI.
CrossRef is the largest DOI registration agency, primarily focused on academic and research content. It not only registers DOIs but also manages detailed metadata, making it easier to search, cite, and link academic publications.
CrossRef specializes in registering DOIs for scholarly content and provides tools for managing metadata.
We identified all the DOIs used by journals using OJS from 2020 to 2023. Then, we searched each DOI in DOI.org and obtained whether the DOI existed and its registration agency. We also searched for all the DOIs available in CrossRef issued under the DOI preffixes of the journals using OJS. Then, we compared the mathes between the two DOI agencies. For the comparison, we compared DOIs found in CrossRef and DOIs found in DOI.org registed under the CrossRef agency.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#first, load all DOIs from OJS dump, DOIs found in DOI.org, and DOIs found in CrossRef -downloaded using all the prefixes found in org
#This is the list of JUOJS and DOIs from 2020 onwards
journals_Dois_gt_2020 <- read_delim("C:/Users/dgenk/Documentos Locales/ScholCommLab/OpenAlex Coverage/data/beacon_all_journals_Dois_gt_2020.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
## Rows: 4656114 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): issn, doi
## dbl (2): context_id, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#these is the deduplicated list of DOIs greater than 2020 without including journal information
Dois_gt_2020 <- Beacon_all_DOIs_dedup_gt_2020 <- read_delim("C:/Users/dgenk/Documentos Locales/ScholCommLab/OpenAlex Coverage/data/Beacon_all_DOIs_dedup_gt_2020.txt",
delim = "\t", escape_double = FALSE,
col_names = FALSE, trim_ws = TRUE)
## Rows: 2487458 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dois_gt_2020 <- Dois_gt_2020 %>% rename( DOI = X1)
#This is the list of all deduplicated DOIs and the result provided by DOI.org
DOIOrg <- read_delim("C:\\Users\\dgenk\\Documentos Locales\\ScholCommLab\\OpenAlex Coverage\\data\\Beacon_all_DOIs_dedup_gt_2020_Checked.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
## Rows: 2482260 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): DOI, RA, status
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Filter all DOIs that were found in CrossRef
DOIOrg <- DOIOrg %>% filter(RA=="Crossref")
#This is the list of all DOIs that CrossRef has for every DOI prefix in the dump
crossref <- read_delim("C:\\Users\\dgenk\\Documentos Locales\\ScholCommLab\\OpenAlex Coverage\\data\\crossRefOJSDOIs\\crossref_prefixes_dois.txt",
delim = "\t", escape_double = FALSE,
col_names = FALSE,
trim_ws = TRUE)
## Rows: 5053259 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
crossref <- crossref %>% rename( DOI = X1)
#select only those DOIs used by JUOJS greater than 2020
crossref <- crossref %>% inner_join(Dois_gt_2020, by = "DOI")
#calculate total number of JUOJS DOIs in CrossRef - according to DOI org
numDOIsDOIOrg <- DOIOrg %>% unique() %>% count()
#calculate total number of JUOJS DOIs in CrossRef - according to CrossRef
numDOIsCrossRef <- crossref %>% unique() %>% count()
#calculate difference of DOIs found in DOI org and CrossRef
overallDiff <- numDOIsDOIOrg$n - numDOIsCrossRef$n
#venn - intersection, DOIS in DOI.org not in CrossRef, DOIS in CrossRef not in DOI.org
intersection <- crossref %>% inner_join(DOIOrg, by= "DOI")
crossref_NOT_DOIOrg <- crossref %>% anti_join(DOIOrg, by= "DOI")
DOIOrg_NOT_crossref <- DOIOrg %>% anti_join(crossref, by= "DOI")
#calculate numbers
numIntersection <- intersection %>% unique () %>% count()
numCrossref_NOT_DOIOrg <- crossref_NOT_DOIOrg %>% unique () %>% count()
numDOIOrg_NOT_crossref <- DOIOrg_NOT_crossref %>% unique () %>% count()
The following are our findings: - Number of DOIs found in DOI.org vs CrossRef.org
cat(
"Intersection:", numIntersection$n, "\n",
"In CrossRef but not in DOIOrg:", numCrossref_NOT_DOIOrg$n, "\n",
"In DOIOrg but not in CrossRef:", numDOIOrg_NOT_crossref$n, "\n"
)
## Intersection: 1210784
## In CrossRef but not in DOIOrg: 2350
## In DOIOrg but not in CrossRef: 698053
Some examples of DOIs that are in CrossRef but not in DOI.org
Some examples of DOIs that are in DOI.org but not in CrossRef