Coverage of COVID research - Microsoft Academic vs. Semantic Scholar vs. Dimensions

Model
Results
Presence of MAG docs in Semantic Scholar and Dimensions
Disclaimer

Yesterday I published a quick comparison of the datasets of the academic publications prepared by Lens, Dimensions, Semantic Scholar (AI2). The main observation was that the datasets in question listed significantly more publications than Scopus and Web of Science https://rpubs.com/alexeilutay/covid_coverage. The preprints were not the only reason, part of the missed documents were the recently published articles from the journals indexed by WoS/Scopus. The latter ones will appear in the dataset, one day.

My approach was lightweight, with no proper check - I just downloaded few datasets that the providers claimed to be constantly updated, counted DOIs, and then announced that Dimensions is the quickest indexer. And I also missed Microsoft Academic, since I was not very experienced with its REST API. Shame on me.

Thanks to Darrin Eide https://twitter.com/DarrinEide/status/1242973784467857408 I realised how to make a term search in Microsoft Academic and now I am going to another few tests, this time with MAG.

Model

I decided to query all the publication from Microsoft Academic that were published in March and contained the words covid or coronavirus in the title or in the abstract.

In order to do that one need to

register at https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/
obtain API Key
decide how to make API requests. I used the R package microdemic (Scott Chamberlain and Christopher Baker (2020). microdemic: ‘Microsoft Academic’ API Client. https://github.com/ropensci/microdemic (devel), https://docs.ropensci.org/microdemic (website))

The code looked like this:

library(microdemic)

query = "And(Or(W='covid',W='coronavirus',AW='covid',AW='coronavirus'), D=['2020-03-01','2020-03-26'])"

# getting to know how many results
k <- ma_calchist(query = query) 
n_res <- k$num_entities

# quntifying how many requests by 100 records to be done
iterations <- ceiling(n_res/100)

# empty dataframe
data <- data.frame(stringsAsFactors = FALSE)

for (i in 1:iterations){
  count = ifelse(i!=iterations, 100, n_res - 100*(iterations-1))
  offset = (i-1)*100
  print(i)
  data.s <- ma_evaluate(query = query2, count = count, offset = offset, 
                        atts = c("Id", "Ti", "Y", "D", "CC", 
                                 "DN", "DOI", "VFN", 
                                 "FP", "LP", "V", "I", "BV", "BT")) 
  data <- bind_rows(data, data.s)
}

Then within just one hour I checked if those DOIs are present in Dimensions or Semantic Scholar. I did not have an API for Dimensions, so do not blame me for my barbarian approach.

library(rvest)
library(httr)

data <- data %>% mutate(dim="")

for (i in 1:NROW(data)){
  print(i)
  if(!is.na(data$DOI[i])){
  doiq <- data$DOI[i]
  # the request open the search page as in browser
  query <- paste0(
    'https://app.dimensions.ai/discover/publication?search_text=',
     doiq,
    '&search_type=kws&search_field=doi')
  # using regex to find a magic line confirming that the result is found
  na.test <- GET(query) %>% content() %>% html_text(trim = TRUE) %>% 
    str_extract("document_count[^\\,]+1\\,") %>% is.na()
   
  data$dim[i] <- ifelse(na.test==TRUE, "absent", "present")
  }
}

Semantic Scholar provides free API for DOIs https://api.semanticscholar.org/, so it was simpler.

library(rvest)
library(httr)

data <- data %>% mutate(ss = "")

for (i in 1:NROW(data)){
  print(i)
  if(!is.na(data$DOI[i])){
    doiq <- data$DOI[i]
    query <- paste0('https://api.semanticscholar.org/v1/paper/', doiq)
    na.test <- GET(query) %>% content() 
    data$ss[i] <- ifelse(exists("error", na.test), "absent", "present")
  }
}

Results

Search in Microsoft Academic returned 947 documents. The metadata contained attribute D (publication date) - its latest value was March 11 (more than 2 week ago).

data %>% ggplot + geom_bar(aes(x = D), fill = "coral")+
  scale_x_date(date_breaks = "2 days", date_labels = "%d/%m")+
  labs(title = "Publication Dates (\"D\") of COVID articles in Microsoft Academic", 
       x = NULL, y = "Documents",
       caption = "Date of query: March 26, 2020",
       subtitle = NULL) +
  scale_y_continuous(breaks=pretty_breaks(),
                     expand = expansion(mult=c(0,0.2)))+
  theme_classic()+
  theme(text = element_text(size=11),
        strip.background = element_rect(size = 0, fill = "grey95"),
        panel.grid.major = element_line(size=0.2, linetype = 2, color="grey70"),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(size = rel(1)),
        legend.position = "bottom", 
        legend.title=element_text(size=rel(0.8)),
        legend.text=element_text(size=rel(0.8)),
        plot.margin=margin(5,5,5,5))

The publications are listed in the table below.

data %>%
  select(Id, doi = DOI, date = D, 
         source_title = BV, type = BT,
         in_Dim = dim, in_S2 = ss) %>% 
  arrange(desc(date)) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE,
      filter = "top", 
      options = list(dom = "tp",
                     columnDefs = list(
                       list(width = '500px', targets = c(4)),
                       list(width = '40px', targets = c(1:2)))))

There are many media sources, I could not believe my eyes. A quarter of documents have FOX in the source title, which is almost 2 times more than a number of all Rxiv preprints.

data %>% count(BV) %>% arrange(desc(n)) %>% 
  mutate(share = percent(n/sum(.$n))) %>% 
   DT::datatable(rownames = FALSE, escape = FALSE,
      filter = "top", options = list(dom = "tp"))

Presence of MAG docs in Semantic Scholar and Dimensions

data %>% 
  filter(!is.na(DOI)) %>% 
  count(in_Dimensions = dim, in_Semantic_Scholar = ss) %>% 
  mutate(share = percent(n/sum(.$n), accuracy = 0.1))

75% of DOIs found in MAG are present also in Dimension and Semantic Scholar and just 3 documents seem to be unique for MAG. They can be easily discovered via filtering the table above.

Even though MAG contained some documents that were absent in Dimensions, one needs just a quick glance to Dimensions dataset to feel the difference.

The Dimensions dataset (I downloaded the current version, as of 2020-03-26) contains 1451 publications fitting our search criteria (i.e. publication date is March 2020 & the terms “covid” OR “coronavirus” are present in a title or an abstract).

dim <-readxl::read_xlsx("D:/Data/covid/Dimensions_20200326.xlsx", 
                         sheet = "Publications") %>% 
  select(doi = DOI, title = Title, abs = Abstract, 
         date = `Publication Date`,
         year = PubYear, srctitle = `Source title`) %>% 
  filter(grepl("covid|coronavirus", title, ignore.case = TRUE)) %>% 
  mutate(doi = str_squish(tolower(doi))) %>% 
  filter(!is.na(doi)) %>% unique() %>% 
  filter(grepl("^2020-03",date)) %>% 
  mutate(date = as.Date(date, "%Y-%m-%d")) 

dim %>% ggplot + geom_bar(aes(x = date), 
                          fill = "#74a9cf")+
  scale_x_date(date_breaks = "3 days", 
               date_labels = "%d/%m")+
  labs(title = "Publication Dates of COVID articles in Dimensions (March 2020)", 
       x = NULL, y = "Documents",
       caption = "Date of query: March 26, 2020",
       subtitle = NULL) +
  scale_y_continuous(breaks=pretty_breaks(),
                     expand = expansion(mult=c(0,0.2)))+
  theme_classic()+
  theme(text = element_text(size=11),
        strip.background = element_rect(size = 0, fill = "grey95"),
        panel.grid.major = element_line(size=0.2, linetype = 2, color="grey70"),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(size = rel(1)),
        legend.position = "bottom", 
        legend.title=element_text(size=rel(0.8)),
        legend.text=element_text(size=rel(0.8)),
        plot.margin=margin(5,5,5,5))

And this is an intersection of COVID-related DOI-having articles published in March and present in MAG and Dimensions.

colorset <- c("Dimensions" = "#74a9cf",
              "Microsoft Academic" = "coral",
              "Microsoft Academic+Dimensions" = "violet")

rbind(data %>% select(doi = DOI) %>% 
    mutate(doi = tolower(doi), db = "Microsoft Academic") %>% unique(),
  dim %>% select(doi) %>% 
    mutate(doi = tolower(doi), db = "Dimensions") %>% unique()) %>% 
  group_by(doi) %>% 
  summarize(s = paste0(unique(db), collapse = "+")) %>% 
  ungroup() %>% count(s) %>%  
  ggplot + geom_bar(aes(x = s, y = n, fill = s), 
                    stat = "identity", position = "dodge")+
  coord_flip()+
  scale_fill_manual(name = NULL, 
                    values = colorset)+
  labs(title = "March COVID Publications in Dimensions and MAG", 
       x = NULL, y = "Documents",
       caption = "Date of query: March 26, 2020",
       subtitle = "covid | coronavirus in title or abstract ") +
  scale_y_continuous(breaks=pretty_breaks(),
                     expand = expansion(mult=c(0,0.2)))+
  theme_classic()+
  theme(text = element_text(size=11),
        axis.text.y = element_blank(),
        strip.background = element_rect(size = 0, fill = "grey95"),
        panel.grid.major = element_line(size=0.2, linetype = 2, color="grey70"),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = rel(1.5), face = "bold"),
        plot.subtitle = element_text(size = rel(1)),
        legend.position = "bottom", 
        legend.title=element_text(size=rel(0.8)),
        legend.text=element_text(size=rel(0.8)),
        plot.margin=margin(5,5,5,5))

Good work, Dimensions!

Disclaimer

This post was just an exercise. I was curious about the topic, and decided to share the results. Anyone can download the datasets and continue this study.

Coverage of COVID research - Microsoft Academic vs. Semantic Scholar vs. Dimensions

Aleksei Lutay

March 26, 2020

Model

Results

Presence of MAG docs in Semantic Scholar and Dimensions

Disclaimer