Analysis of funding information present in Scopus and Web of Science

INTRO
PLAN OF STUDY
SCOPUS DATA PROCESSING
WOS DATA PROCESSING
DATA FOR ANALYSIS
- Scopus
- Web of Science
TEST 1. WHO IS FASTER?
TEST 2. PRESENCE OF FUNDING INFORMATION
TEST 3. SENSITIVITY FOR FUNDING TEXTS
TEST 4. SENSITIVITY FOR FUNDER NAMES
TEST 4. SPECIFICITY OF NAME RECOGNITION
TEST 5. MISSING AWARDS
LIMITATIONS
CONCLUSIONS
ACKNOWLEDGEMENTS
CITATION
CONTACTS
REFERENCES

INTRO

This post is an addendum to the study of 24 academic journals of Russian origin and their presence in 3 academic databases - the Lens, CrossRef, and Scopus. The part1 dealt with a bibliographic metadata, the part2 described an accuracy of the author and affiliation metadata, and the part3 was devoted to funding information.

I did not include Web of Science in that study for few reasons, of which, the crucial was that I did not have an access to WoS. After a release of the last part I asked my FB friends about assistance, and qute soon got the txt exports of WoS data. I fully satisfied my curiousity and felt a need to produce this report.

PLAN OF STUDY

Both Scopus and Web of Science were searched using a list of ISSN (20758251 OR 23084057 OR 09599436 OR 00970549 OR 2311911x OR 00172278 OR 10634541 OR 00167029 OR 1026051x OR 00445134 OR 19967756 OR 00241148 OR 23093994 OR 00310301 OR 01316095 OR 20720351 OR 01316397 OR 00360279 OR 10637869 OR 18138691 OR 2078502x OR 00109525 OR 10637745 OR 02023822 OR 00150541 OR 23109599 OR 1364551x OR 23136871 OR 15561968 OR 20718721 OR 24108731 OR 15556174 OR 20720378 OR 23134836 OR 14684829 OR 14684780 OR 18138705 OR 16083075 OR 1562689x OR 15738493) and limited to 2018-2019 years.

Both databases provide two fileds - one with an original acknowledgement text (named Funding text in Scopus and FX in WoS), and the other one that contains the list of recognized funder names & award numbers, named Funding details in Scopus and FU in WoS). I also use the term “funding items” referring to any item of teh Funding details - either to a funder name, or a award number, or both.

As we observed in the part 3, not all the funding items present in the “Funding text” are recognized correctly, so the “Funding details” can be an empty field, even if the “Funding text” field contains some info.

My plan here is to compare a funding metadata contained in Scopus and Web of Science extracts in order to find out:

who is faster in terms of indexing new contents?
who is more accurate in finding the funding texts?
who recognizes more names and award numbers?
whose name recognition algorithm is more accurate?

SCOPUS DATA PROCESSING

The known problem with Scopus CSV data is that a column “Funding details” contains a special delimiter ‘’ for the individual funders, which affects reading CSV file. Therefore, first we read CSV files in a raw format, substitute ‘’ for ‘|’, save, and read it again.

ff<-list.files(paste0(dir,"/data/scopus2/"))
ff<-ff[grepl("scopus",ff)&grepl("csv",ff)]
sc.data <- data.frame()
i<-1

for (i in 1:length(ff)){
  sc.fund <- read_file(paste0(dir, "/data/scopus2/",ff[i]))     
  sc.fund <- gsub(pattern="\\\n\\\n", replacement = "\\|",sc.fund)
  # write it to temporary file to read it again with read_Lines
  write_lines(sc.fund, paste0(dir, "/data/scopus2/test.csv"))
  data <-  read_csv(paste0(dir, "/data/scopus2/test.csv"))
  nm <-names(data) 
  datax <- data %>% 
    unite(col="Fund_text", nm[grepl("Funding Text",nm)], sep="_")
  sc.data <- rbind(sc.data, datax)
}
sc.data %>% write_excel_csv(paste0(dir, "/data/scopus2/sc.data.csv"))

sc.fund <- sc.data %>% 
  select(DOI, "year"=Year, ISSN, "src.title"=`Source title`, eid=EID,
         auth.affils=`Authors with affiliations`,
         fun.details=`Funding Details`, fun.text = Fund_text) %>% 
  mutate(DOI=trimws(tolower(DOI)), ISSN=toupper(ISSN), 
         src.title=trimws(toupper(src.title))) %>% 
  # now we separate again the individual awards in Funding details section 
  mutate(fun.details=strsplit(fun.details, split="\\|")) %>%
  unnest(fun.details) %>%
  # cleaning after
  mutate(fun.text=gsub("_NA|NA_|NA_NA","", fun.text)) %>% 
  mutate(fun.text= ifelse(nchar(fun.text)<2,NA,fun.text))

sc.fund %>% write_excel_csv(paste0(dir, "/data/scopus2/sc.fund.csv"))

WOS DATA PROCESSING

It’s my first take on WoS text data, I decided to bite this bullet myself, so thecode below could be very amateurish. The WoS data was in RIS format, i.e. present as the text strings preceeded either with a 2-letter code and 1 space for new fields, or with 3 consecutive spaces for the parts of the list structure within one field. The records are separated with the code ER and an empty line. Part of the code below is adapted from post.

fris<-list.files(paste0(dir,"/data/wos/"))
fris<-fris[grepl(".txt",fris)]
wos.data <- data.frame()

for (i in 1:length(fris)){
  x <- paste0(dir, "/data/wos/",fris[i])
# replacing the spaces with | delimiter  
  wos.i <- read_file(x)     
  wos.i <- gsub(pattern="\\\n\\s\\s\\s", replacement = "\\|",wos.i) 
  write_lines(wos.i, paste0(dir, "/data/wos/test.csv"))
  data <- readLines(paste0(dir, "/data/wos/test.csv"), encoding = "UTF-8")
  df <- data.frame(article = NA, ris = data, stringsAsFactors = FALSE)
  end_of_records <- rev(grep(pattern = '^ER', data))
  article = length(end_of_records)

# numerating the record fields  
for(i in end_of_records){
  df$article[1:i] <- article
  article = article - 1}

## removing the heavy abstract/keywords fields - we do not need it here
df <- df %>%
  filter(grepl('^[A-Z0-9]{2}\\s+', ris)) %>% 
  filter(!grepl('^KW', ris)) %>%
  filter(!grepl('^AB', ris)) %>%
  filter(!grepl('^N1', ris))

# extracting the field codes
df <- df %>% 
  mutate(code=substr(ris,1,2)) %>% 
  mutate(value=sapply(str_replace_all(ris, 
                                      pattern='^[A-Z0-9]{2}\\s+',
                                      replacement=""),
                      function(x) unlist(x, use.names = F)))

# converting into DF, glueing all the lists with | (to store in csv format)  
df2 <- df %>% select(-ris) %>% 
  group_by(article, code) %>% 
  summarize(value=paste0(value, collapse="|")) %>% 
  ungroup() %>% 
  spread(code, value) %>% 
  select(AF, AU, C1, DA, DE, DI, DT, EI, FU, FX, ID, IS, 
         LA, NR, PY, RI, RP, SC, SN, SO, TC, UT, VL, WC)

wos.data <- rbind(wos.data, df2)
print(i)
}
wos.data %>% write_excel_csv(paste0(dir, "/data/wos/wos.data.csv"))

wos.data <- read_csv(paste0(dir, "/data/wos/wos.data.csv"))
wos.fund <- wos.data %>% 
  select(DOI=DI, "year"=PY, print=SN, electronic=EI,  
         "src.title"=SO, ut=UT,
         auth.affils=C1,
         fun.details=FU, fun.text = FX) %>%
# wos data contains 2 ISSNs, so I match them against the only Scopus ISSN
  mutate(print=gsub("-","",print),
         electronic=gsub("-","",electronic)) %>% 
  mutate(ISSN=ifelse(print %in% sc.fund$ISSN, print,
                     ifelse(electronic %in% sc.fund$ISSN, electronic, NA))) %>% 
  mutate(DOI=trimws(tolower(DOI)), ISSN=toupper(ISSN), 
         src.title=trimws(toupper(src.title))) %>% 
# splitting and unnesting the records like 
# funder1[award1; award2],funder2[award3; award4]  
  mutate(fun.details=gsub("\\|"," ",fun.details),
         fun.text=gsub("\\|"," ",fun.text)) %>%
  mutate(fun.details=strsplit(fun.details, split="; ")) %>%
  unnest(fun.details) 

wos.fund %>% write_excel_csv(paste0(dir, "/data/wos/wos.fund.csv"))

DATA FOR ANALYSIS

As the list of 24 journals was inherited from another study, we had to narrow down it to 16 titles indexed both in Scopus and WoS.

Scopus

## Observations: 2,457
## Variables: 8
## $ DOI         <chr> "10.5800/gt-2019-10-1-0406", "10.5800/gt-2019-10-1...
## $ year        <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ ISSN        <chr> "2078502X", "2078502X", "2078502X", "2078502X", "2...
## $ src.title   <chr> "GEODYNAMICS AND TECTONOPHYSICS", "GEODYNAMICS AND...
## $ eid         <chr> "2-s2.0-85065102231", "2-s2.0-85065093463", "2-s2....
## $ auth.affils <chr> "Sharkov, E.V., Institute of Geology of Ore Deposi...
## $ fun.text    <chr> NA, NA, NA, "We sincerely thank Professor S. I. Sh...
## $ fun.details <chr> NA, NA, NA, "12-05-91161-GFEN-a", "National Natura...

Web of Science

## Observations: 2,807
## Variables: 8
## $ DOI         <chr> "10.1134/s0016702919070085", "10.1134/s00167029190...
## $ year        <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ ut          <chr> "WOS:000473495800001", "WOS:000473495800002", "WOS...
## $ auth.affils <chr> "[Moiseenko, T. I.] Russian Acad Sci, Vernadsky In...
## $ fun.text    <chr> "This study was financially supported by the Russi...
## $ ISSN        <chr> "00167029", "00167029", "00167029", "00167029", "0...
## $ fun.details <chr> "Russian Science Foundation [18-17-00184]", "Russi...
## $ src.title   <chr> "GEOCHEMISTRY INTERNATIONAL", "GEOCHEMISTRY INTERN...

TEST 1. WHO IS FASTER?

We compare the numbers of publications in each database.

aa <- left_join(
  wos.fund %>% group_by(src.title,ISSN) %>% summarize(WoS=n_distinct(ut)),
  sc.fund %>% group_by(src.title,ISSN) %>% summarize(Scopus=n_distinct(eid))
  ) %>% 
  mutate(diff=case_when(
    WoS>Scopus ~ "WOS", 
    WoS==Scopus ~ "EQUAL",
    WoS<Scopus ~ "SCOPUS"),
    max=pmax(WoS, Scopus)) 

aa %>% 
  group_by("a") %>% 
  summarize(WoS = sum(WoS), 
            Scopus = sum(Scopus)) %>% 
  ungroup() %>% 
  select(2:3) %>% 
  datatable(escape=FALSE, rownames = FALSE, options = list(dom = 't'), 
            width="200px") %>% 
  formatStyle(c(1:2), `text-align` = "center")

A color of segment refers to the citation index containing more publications for a particular title.

ggplot(aa) +
  geom_segment(aes(x=reorder(ISSN,WoS), xend=reorder(ISSN,WoS),
                   y=0, yend=max), 
               color="grey20", size=0.25, linetype=2)+
  geom_segment(aes(x=reorder(ISSN,WoS), xend=reorder(ISSN,WoS),
                   y=WoS, yend=Scopus, color=diff), size=4)+
  geom_point(aes(x=reorder(ISSN,WoS), y=WoS),
             stroke=0.4, shape=21, size=0.5, fill="grey10")+
  geom_point(aes(x=reorder(ISSN,WoS), y=Scopus),
             stroke=0.4, shape=21, size=0.5, fill="grey10")+
  scale_y_continuous(expand = expand_scale(mult=c(0,0.15)))+
  coord_flip()+
  scale_color_manual(values=c("grey90","orange","violet"), 
                     name="INDEX WITH MORE\nPUBLICATIONS")+
  labs(title="NUMBER OF INDEXED PUBLICATIONS",
       subtitle="PUBYEAR: 2018-2019",
       caption="Accessed: July 17, 2019", y="NUMBER OF PUBLICATIONS",
       x="ISSN")+
  mytheme+
  theme(axis.ticks.y = element_blank())+
  ggsave(paste0(dir, "/wos_scopus_diff.png"),
         width=20, height = 10, units="cm", dpi=300)

For 11 journals WoS had more publications than Scopus, Scopus overperformed WoS only for 1 title (FIBRE CHEMISTRY), the coverage was equal for 4 journals - overall, for 16 selected journal titles, Web of Science contains by 14% more publications than Scopus. Is 11:4:1 ratio enough to say that WoS overperforms Scopus by speed of indexing. Well, at least for this set of journals. But what really matters is not a speed (I can also press the buttons 200 symbols a minute), but an accuracy, right?

TEST 2. PRESENCE OF FUNDING INFORMATION

Let’s compare how accurately Scopus and WoS collect the funding texts (i.e. the original acknowledgements sections).

sc_x <- sc.fund %>%
  select(src.title, ISSN, eid, fun.text) %>% unique() %>% 
  mutate(status=ifelse(is.na(fun.text)==FALSE, "present", "missing")) %>%
  add_count(src.title) %>% 
  group_by(src.title, ISSN, status, n) %>% 
  summarize(pubs=n_distinct(eid)) %>% ungroup()

wos_x <- wos.fund %>%
  select(src.title, ISSN, ut, fun.text) %>% unique() %>% 
  mutate(status=ifelse(is.na(fun.text)==FALSE, "present", "missing")) %>%
  add_count(src.title) %>% 
  group_by(src.title, ISSN, status, n) %>% 
  summarize(pubs=n_distinct(ut)) %>% ungroup()

x_x <- rbind(wos_x %>% mutate(source="WoS"),
             sc_x%>% mutate(source="Scopus")) %>% 
  mutate(label2=scales::percent(pubs/n, accuracy=1)) %>% 
  mutate(label=ifelse(nchar(src.title)<64,
                      str_wrap(src.title,25),
                      str_wrap(paste0(substr(src.title, 1,64),"..."),25)))

x_x %>% 
  ggplot(aes(x=reorder(ISSN,n), y=pubs, fill=status))+
  geom_col()+
  geom_text(inherit.aes = FALSE, data=x_x[x_x$status=="present",],
            aes(x=reorder(ISSN,n), y=n+2, 
                label=scales::percent(pubs/n, accuracy=1)),
            size=2.7, fontface="bold", hjust=0)+
  facet_wrap(~source, ncol=3)+
  scale_y_continuous(expand = expand_scale(mult=c(0,0.15)))+
  coord_flip()+
  scale_fill_manual(values=c("#fc8d62", "#66c2a5"), name="SOURCE")+
  labs(title="PRESENCE OF FUNDING INFO",
       subtitle="PUBYEAR: 2018-2019",
       caption="Accessed: July 17, 2019", y="NUMBER OF PUBLICATIONS", 
       x="ISSN")+
  mytheme+
  ggsave(paste0(dir, "/wos_scopus_funding_long.png"), 
         width=20, height = 10, units="cm", dpi=300)

WoS dataset contained a higher proportion of publications with funding information. Let’s also draw the chart with the ratios.

x_x %>%
  ggplot(aes(x=source, y=pubs, fill=status))+
  geom_bar(position="fill", stat="identity")+
  geom_text(inherit.aes = TRUE, 
            data=x_x %>% filter(status=="present"),
            aes(x=source, y=1.1, label=label2),
            family="PT Sans Narrow",size=4, fontface="bold", vjust=0, color="#0072B2")+
  facet_wrap(~label, ncol=4, strip.position = "bottom")+
  scale_y_continuous(breaks=c(0,0.25,0.5,0.75,1),
                     labels=percent_format(),
                     expand = expand_scale(mult=c(0,0.2)))+
  scale_fill_manual(values=c("#fc8d62", "#66c2a5"), name="")+
  labs(title="PRESENCE OF FUNDING INFO",
       subtitle="PUBYEAR: 2018-2019",
       caption="Accessed: July 17, 2019",
       y="SHARE OF PUBLICATIONS WITH FUNDING INFO", x=NULL)+
  guides(fill=guide_legend(title.position = "left", 
                           title.hjust = 1, title.vjust = 0.5, 
                           label.position = "right", 
                           label.hjust = 1, label.vjust = 0.5))+
  mytheme+
  theme(legend.position = "bottom",
        legend.justification = c("right", "top"),
        axis.text.x = element_text(angle=0, hjust=0.5, size=rel(1)),
        strip.text = element_text(size=rel(0.85)),
        strip.background = element_rect(fill="lightyellow", color=NA))+
  ggsave(paste0(dir, "/wos_scopus_funds_long_2.png"),
         width=20, height = 15, units="cm", dpi=300)

It seems that WoS not only indexes faster, but also captures more funding texts. Only for 5 titles WoS and Scopus had comparable (+/- 1%) proportions of the funding-presenting publications, and only for 1 journal (Russian Mathematical Survey) Scopus overperformed WoS (90% against 67%). The final ratio 10:5:1 again favoured WoS.

After such advantage, a total ratio for 16 journals may look surprisingly close to even - 51.3% (WoS) : 50.0% (Scopus), but the numbers have its own truth.

The ratios above may look interesting, of course, but we can’t make any conclusion, as the databases contained different number of publications. For example, Scopus dataset comprised 42 publications of the “Russian Mathematical Survey” with 90% presence of the funding texts, while Web of Science had 60 publications with 67% share. To benchmark the databases we had to select only the publications present in both datasets.

TEST 3. SENSITIVITY FOR FUNDING TEXTS

First, we need to create a list of DOI numbers present in both datasets (it’s not a list, of course, but a vector). Second, we will use the list of DOIs to filter the WoS and Scopus datasets. Third, we will count (a) the number of publications having non-empty “Funding Text” field, and (b) a share of publications containing the award numbers that match a pattern dd-dd-ddddd (where d stands for a digit). This pattern is used by 2 largest Russian funding agencies - Russian Foundation for Basic Research (RFBR) and Russian Science Foundation (RSF), so there must be a lot of publications.

dois <- intersect(sc.fund$DOI, wos.fund$DOI) %>% na.omit()

# actually the pattern is a bit more complex, we expect that 
# the symbols surrounding dd-dd-ddddd are not the digits.

sc_z <- sc.fund %>%  
  filter(DOI %in% dois) %>% 
  select(DOI, fun.text) %>% unique() %>% 
  mutate(Source="Scopus",
         fun.presence=!is.na(fun.text),
         rus.pattern = 
           str_detect(fun.text, "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>% 
  group_by(Source, DOI) %>% 
  summarize(fun.text = paste0(fun.text, collapse=" "),
            fun.presence = sum(fun.presence, na.rm = T),
            rus.pattern =  sum(rus.pattern, na.rm = T)) %>%
  ungroup()  

wos_z <- wos.fund %>%  
  filter(DOI %in% dois) %>% 
  select(DOI, fun.text) %>% unique() %>% 
  mutate(Source="WoS",
         fun.presence=!is.na(fun.text),
         rus.pattern = 
           str_detect(fun.text, 
                      "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>% 
  group_by(Source, DOI) %>% 
  summarize(fun.text = paste0(fun.text, collapse="||"),
            fun.presence = sum(fun.presence, na.rm = T),
            rus.pattern =  sum(rus.pattern, na.rm = T)) %>%
  ungroup()   

# now we can join the parts and calculate the ratios
rbind(sc_z, wos_z) %>% 
  group_by(Source) %>% 
  summarize(Publications=n(),
            `with_Funding_Text` = sum(fun.presence),
            `with_dd-dd-ddddd_pattern` =  sum(rus.pattern)) %>% 
  mutate(share_Funding_Text = 
           percent(`with_Funding_Text`/Publications, accuracy=1),
         share_pattern = 
           percent(`with_dd-dd-ddddd_pattern`/Publications, accuracy=1)) %>%
  select(1,2,3,5,4,6) %>% 
  datatable(escape=FALSE, rownames = FALSE, options = list(dom = 't')) %>%
  formatStyle(c(2:6), `text-align` = "center")

The table shows that WoS records for 1819 publications contained by 5% more publications with none-empty “Funding Text” field than corresponding Scopus records (56% vs. 51%), and by 5% more awards with the numbers matching the pattern in question (36% vs. 31%). At least for the selected 16 journals, WoS finds more funding texts.

TEST 4. SENSITIVITY FOR FUNDER NAMES

Now it’s time to analyze the funding details. Let me elaborate a bit why I think that a quality fo information in this field deserve a special attention. When a researcher wants to find all the publications supported by Funder XYZ, the search query will most probably address both “funding text” and “funding details” fields, and the search will retrieve all the documents having a string “Funder XYZ”. Now let’s imagine that Clarivate or Elsevier decide to produce a large analytical report about the global research funding. What field are they likely to use - the Funding Text, containing the irregular strings, or the Funding details with granular Funder names and Award numbers? One may argue that these fellows have their reputations at stake, so for the report they will recalculate the data few more times to ensure the best possible quality. Well, it may be so. But there are many other, more frightening possibilities, e.g. that these companies decide to incorporate the funding daya into their analytical solutions (SciVal, InCites). Will it be the funding texts or the funding items? Many more reports will be produced based on these data and used for who knows which decisions. Apologies, if this is pretty obvious.

In order to benchmark a specificity of WoS and Scopus for the funder names, we need to count a number of awards for each publication (by DOI) and compare the results.

Funding details in WoS dataset (FU field in the original WoS file) has a structure as follows - funder.name [award1, award2]. The data is accurate, e.g.if the field contains just an award number without a funder name, the number is also put in square brackets. So funding items can easily be separated, but the awards need also to be unnested.

Funding details in Scopus dataset are structured like funder.name: award1, award2, but the items are not always as accurate as those in WoS. There are some strings like “RFBR” and “183300576 mol_a” without any mark whether it is a funder name or an award number. Such strings we qualify in a simple way - if the string without a colon contains a digit, then we will consider it to be an award number, if it comprises of the letters and non-digits, we will count it as a funder name.

wos.names <- wos.fund %>%
  filter(DOI %in% dois) %>% 
  select(src.title, ISSN, DOI, fun.details) %>% unique() %>% 
  mutate(funder.name = 
           sapply(str_extract(fun.details, ".+(?= \\[)"), 
                  function(x) toupper(unlist(x)))) %>% 
  mutate(funder.number = 
           sapply(str_extract(fun.details, "(?<=\\[).+(?=\\])"), 
                  function(x) unlist(x))) %>%
  mutate(funder.number = strsplit(funder.number, split="\\,\\s")) %>% 
  unnest(funder.number) 

sc.names <- sc.fund %>%
  filter(DOI %in% dois) %>% 
  select(src.title, ISSN, DOI, fun.details) %>% unique() %>% 
  # qualifying the strings without colon
  mutate(xx = ifelse(grepl(":\\s",fun.details),
                     fun.details,
                     # we add a colon before or after
                     ifelse(grepl("[0-9]",fun.details),
                            paste0(": ",fun.details),
                            paste0(fun.details,": ")))) %>% 
  # cleaning
  mutate(xx=ifelse(xx=="NA: ",NA,xx)) %>% 
  # separating the funder names and award numbers
  mutate(funder.name = 
           sapply(str_extract(xx, ".+(?=\\:\\s)"), 
                  function(x) toupper(unlist(x)))) %>% 
  mutate(funder.number = 
             sapply(str_extract(xx, "(?<=\\:\\s).+"), 
                    function(x) unlist(x))) %>%  
  # unnesting the award numbers
  mutate(funder.number = strsplit(funder.number, split="\\,\\s")) %>% 
  unnest(funder.number)

# summarizing - counting a number of {funder names} 

sc_n <- sc.names %>% group_by(src.title, ISSN, DOI) %>% 
  summarize(n=sum(!is.na(funder.name))) %>% ungroup() 

wos_n <- wos.names %>% group_by(src.title, ISSN, DOI) %>% 
  summarize(n=sum(!is.na(funder.name))) %>% ungroup()

nn <- sc_n %>% 
  left_join(wos_n, 
            by=c("src.title", "ISSN", "DOI"), 
            suffix = c(".Scopus", ".WoS")) %>% 
  mutate(status = 
           case_when(
            (n.Scopus>0&n.WoS==0) ~ "a",
            (n.Scopus>0&n.WoS<n.Scopus) ~ "b",
            (n.Scopus==n.WoS&n.WoS>0) ~ "c",
            (n.Scopus==n.WoS&n.WoS==0) ~ "c0",
            (n.Scopus>0&n.WoS>n.Scopus) ~ "d",
            (n.Scopus==0&n.WoS>0) ~ "e"))

nn %>% count(status) %>%
  mutate(n2=n/1819, label=percent(n2, accuracy=1)) %>% 
  ggplot() + 
  geom_bar(aes(x = status, y = n, fill=status),
           position = "dodge", stat="identity", size=0.25, color = "black")+
  geom_text(aes(x = status, y = n+10, label=label),
            fontface="bold", hjust=0)+
  coord_flip()+
  scale_y_continuous(expand = expand_scale(mult= c(0,0.2))) +
  scale_fill_manual(name=NULL,
                    values=c("#b35806", "#fdb863",
                             "#66bd63","#f7f7f7",
                             "#9e9ac8", "#54278f"),
                    labels=c("WoS lost all", "WoS lost some",
                              "WoS = Scopus", "No Funding Info",
                               "Scopus lost some", "Scopus lost all"))+
  labs(title = "HOW SPECIFIC ARE WOS AND SCOPUS TO FUNDER NAMES",
       subtitle = "16 JOURNALS, 1819 DOIS, PUBYEAR: 2018-2019",
       x = "", y = "PUBLICATIONS (COUNT)", 
       caption="Accessed: July 17, 2019") +
  guides(fill = guide_legend(reverse = TRUE))+
  mytheme+
  theme(legend.position = "bottom",
        legend.justification = c("right", "top"),
        strip.text = element_text(size=rel(0.6), hjust=0),
        strip.background = element_rect(fill="lightyellow", color=NA),
        panel.grid = element_blank(), 
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggsave(paste0(dir, "/wos_scopus_fund_details.png"), 
         width=20, height = 10, units="cm", dpi=300)

The chart shows that in the array of 1819 publications, 46% had no funding details in any database, 32% had equal number of {funder name-award} pairs in the funding detail field. WoS identified more funding items in 12% of publications, in 7% of cases the corresponding fields in Scopus lacked any items. WoS identified less funding items than Scopus in 10% of publications, in 3% of cases Scopus found some items while WoS found nothing.

The cases when one database lacks any funding items, while the other has some, are plain - we have already found that WoS finds more funding texts than Scopus. This can be a result of the ambiguous layouts and unclear language that prevents identification of funding text.

Less trivial cases, to my opinion, are when both databases have some funding items, but one identified more than the other. We can see that in such cases Scopus found a bit more than WoS - 7% vs. 5%. Is it a sign of higher specificity of Scopus for the funder names & award numbers (i.e. ability to distinguish them from a simple text)? Or did Scopus created some extra (duplicated, fake, you name it) items?

Let me give you an example. I already mentioned 2 Russian funders - Russian Foundation for Basic Research (RFBR) and Russian Science Foundation (RSF). What if the publication contains the funder name in a non-conventional form, say, Russian Fund of Fundamental Investigations, which is a direct translation from the RFBR’s Russian title. Unless this name variant was earlier added to the funder’s profile (i.e. associated with RFBR), an automatic name recognition will fail to recognize RFBR.

TEST 4. SPECIFICITY OF NAME RECOGNITION

We are going to extract the funder names associated with the award numbers matching the pattern dd-dd-ddddd. Then we will use the individual award numbers to match Scopus and WoS names and see their diversity.

sc_r <- sc.names %>% 
  mutate(rus.pattern = str_detect(funder.number, 
                      "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>%
  filter(rus.pattern==TRUE)  %>% 
  mutate(rus.award = sapply(
    str_extract(funder.number, "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)"),
                function(x) unlist(x))) %>%
  mutate(funder.name=ifelse(is.na(funder.name),
                            "-----", funder.name))

wos_r <- wos.names %>% 
  mutate(rus.pattern = str_detect(funder.number, 
                      "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)")) %>%
  filter(rus.pattern==TRUE)  %>% 
  mutate(rus.award = sapply(
    str_extract(funder.number, "(?<!\\d)\\d{2}-\\d{2}-\\d{5}(?!\\d)"),
                function(x) unlist(x))) %>%
  mutate(funder.name=ifelse(is.na(funder.name),
                            "-----", funder.name))

edges <- sc_r %>% 
  select(Scopus.name=funder.name, rus.award) %>% 
  full_join(wos_r %>% select(WoS.name=funder.name, rus.award)) %>% 
  select(1,3) %>%  na.omit() %>% 
  count(Scopus.name, WoS.name, name = "value")

write_excel_csv(edges, "rfbr_rsf_names.csv")

read_csv("rfbr_rsf_names.csv") %>% 
  arrange(desc(value)) %>% 
  setNames(c("Scopus.name", "WoS.name", "Number of Items")) %>% 
# JSPS (JAPAN SOCIETY FOR PROMOTION OF SCIENCE) is present 
# in the table, since their award numbers match the pattern. 
# Let's forget about them.
  filter(!grepl("JSPS|JAPAN", Scopus.name)) %>% 
  filter(!grepl("JSPS|JAPAN", WoS.name)) %>% 
  datatable(escape=FALSE,
          rownames = FALSE, filter = 'top', 
          options = list(pageLength = 10,
                         lengthMenu = c(5, 10, 20),
                         autoWidth = TRUE)) %>% 
  formatStyle(3, `text-align` = "center")

Russian Foundation for Basic Research is associated with the following text strings:

РОССИЙСКИЙ ФОНД ФУНДАМЕНТАЛЬНЫХ ИССЛЕДОВАНИЙ (РФФИ)
RUSSIAN FOUNDATION FOR BASIC RESEARCH, RFBR
RUSSIAN FOUNDATION FOR FUNDAMENTAL INVESTIGATIONS
RFBR
RFBR-EAST
RUSSIAN BASIC RESEARCH FUND
RUSSIAN FUNDAMENTAL RESEARCH FUND
RUSSIAN HUMANITARIAN FOUNDATION
RUSSIAN FOUNDATION FOR HUMANITIES

Russian Science Foundation is recognized in the following names:

RUSSIAN SCIENCE FOUNDATION
RUSSIAN SCIENCE FOUNDATION, RSF
RSF
RSCF
RSF PROJECT
RUSSIAN SCIENCE FUND
RUSSIAN SCIENTIFIC FOUNDATION
RUSSIAN SCIENCE FOUNDATION UNDER RSF GRANT
RUSSIAN SCIENCE FOUNDATION AT THE STEKLOV MATHEMATICAL INSTITUTE OF RUSSIAN ACADEMY OF SCIENCES

Both Scopus and WoS contained the unique name variants that were present only in one of 2 databases, i.e. the other did not recognize the text string as a funder name.

There were few award numbers that in one database were associated with the convenient funder names and lacked any associations in the other.

There were 14 name variants in Scopus and 21 in WoS, suggesting that Scopus treates the names more creatively (aggresively?). For instance, where the record in WoS referred to “RSF PROJECT”, the corresponding record in Scopus contained “RUSSIAN SCIENCE FOUNDATION”. The drawback of such creativity (of Scopus algorithms) were the errors emerging either from incorrect name associations or from attempts to decipher the abbreviations. Some errors emerged from inaccurate splitting of the complex funding texts:

what WoS stored as RSCF (as in the original publication) Scopus transformed into RICHMOND COUNTY SAVINGS FOUNDATION
what was read by WoS as RUSSIAN FOUNDATION FOR BASIC RESEARCH (ERA) Scopus converted into ERASMUS+

Few other examples:

NATIONAL UNIVERSITY OF SCIENCE AND TECHNOLOGY in Scopus vs. RUSSIAN FOUNDATION FOR BASIC RESEARCH in WoS
THE MINISTRY OF ECONOMIC AFFAIRS AND EMPLOYMENT in Scopus vs. RUSSIAN SCIENCE FOUNDATION in WoS
Scopus records for DOI: 10.1134/s0031030119030110 and DOI: 10.1134/s0031030118100088 from PALEONTOLOGICAL JOURNAL listed only one funder name RUSSIAN ACADEMY OF SCIENCES, whereas the first publication in WoS contained an award from RUSSIAN FOUNDATION FOR BASIC RESEARCH, the other - an award from RUSSIAN SCIENCE FOUNDATION.

TEST 5. MISSING AWARDS

This is a comparison of presence of unique {DOI-award number} pairs, where award numbers match dd-dd-ddddd patterns, in each database.

awards <- rbind(sc_r %>% select(DOI, rus.award),
                wos_r %>% select(DOI, rus.award)) %>% 
  unique() %>% 
  mutate(code=paste0(DOI, rus.award),
         scopus=code %in% paste0(sc_r$DOI, sc_r$rus.award),
         wos=code %in% paste0(wos_r$DOI, wos_r$rus.award)) %>% 
  mutate(ccc=
           case_when(scopus==TRUE & wos==TRUE ~ "exist in both databases",
                     scopus==FALSE & wos==TRUE ~ "exist only in WoS",
                     scopus==TRUE & wos==FALSE ~ "exist only in Scopus")) %>% 
  group_by(ccc) %>% 
  summarize(n=n()) %>% 
  mutate(label=paste0(percent(n/sum(.$n), accuracy=1)," ",ccc))

awards %>% ggplot()+
  geom_bar(aes(x = "1", y = n, fill=label),
           position = "fill", stat="identity", size=0.25, color = "black")+
  coord_flip()+
  scale_x_discrete(expand = expand_scale(mult= c(0,0)))+
  scale_y_continuous(labels=percent_format(accuracy=1), 
                     expand = expand_scale(mult= c(0,0)))+
  scale_fill_manual(name=NULL,
                    values=c("#b35806","#54278f", "#66bd63"))+
  labs(title = "GRANT AWARDS PRESENT IN WOS AND SCOPUS",
       subtitle = "16 JOURNALS, 1819 DOIS, PUBYEAR: 2018-2019",
       x = "", y = "{DOI - AWARD} PAIRS", 
       caption="Accessed: July 17, 2019") +
  guides(fill = guide_legend(reverse = TRUE))+
  mytheme+
  theme(
    plot.margin = margin(5,15,0,5),
    legend.position = "right",
        legend.justification = c("right", "top"),
        panel.grid = element_blank(), 
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggsave(paste0(dir, "/wos_scopus_fund_awards.png"), 
         width=20, height = 5, units="cm", dpi=300)

WoS captured more unique award numbers than Scopus.

LIMITATIONS

This report did not evaluate a completeness of funding data. We did not scan the original full texts in order to evaluate how many acknowledgement sections were not captured by both indices. We can only speculate that both databases could fail to capture the Funding information present in Russian language. This can explain a low share of the articles with funding information in the journals like ZOOLOGICHESKII ZHURNAL.

The observed ratios are valid for 16 journals that selected (in another study) to represent the journals of Russian origin, by various publishers and from different subject areas. The results do not allow to make wider conclusions about the ratios in whole WoS and Scopus databases.

CONCLUSIONS

WoS indexed the journals contents faster than Scopus. Total number of 2018-2019 publications in 16 selected journals was by 14% higher in WoS than in Scopus. Overperformance {WoS:Scopus} score was 11:1 (i.e. for 4 journals the databases contained the equal number of documents).
WoS captured more funding texts than Scopus. The total shares of the publications with funding information for all 16 journals were comparable - 51.3% vs. 50%. When the datasets were limited to a set of unique 1819 DOIs present in both databases, the difference became higher - 56% (WoS) vs. 51% (Scopus). The shares of the publications containing the award numbers matching the selected pattern also favoured WoS: 36% vs. 31% (Scopus).
Both indices failed to recognize all the funder names and award numbers (funding items), but Scopus failed a bit more often in our experiment - in a set of 1819 publications present in both Scopus and WoS, 12% had more funding items in WoS, and 10% had more items in Scopus. WoS lost all the funding items for 3% of publications (i.e. Scopus records for those publications had some funding details), when Scopus did that for 7% cases.
WoS appreared to treat the funder names in a more conservative and accurate way than Scopus, which resulted in a larger number of the name variants for funders in WoS. Apparently, Scopus merged the names variants and deciphered the abbreviations which led to the mistakes.
In a set of 1819 publications (2018-2019) from 16 journals WoS recognized by 16% more {DOI-Award} pairs than Scopus. Award - the grant awards with numbers matching {dd-dd-ddddd} pattern, where {d} is a digit.

Though this study has a limited practical application, it offers a model and some code for assessment of the funding information in the research papers.

ACKNOWLEDGEMENTS

I am grateful to the researcher (willing to remain anonymous) for provision of WoS data for this study.

CITATION

Lutay, A. (2019, July 26). Analysis of funding information present in Scopus and Web of Science - a case of 16 journals of Russian origin (Version 1). figshare. <https://doi.org/10.6084/m9.figshare.9109394>

CONTACTS

Twitter

Figshare

REFERENCES

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Hadley Wickham (2018). scales: Scale Functions for Visualization. R package version 1.0.0. https://CRAN.R-project.org/package=scales

Winston Chang, (2014). extrafont: Tools for using fonts. R package version 0.17. https://CRAN.R-project.org/package=extrafont

Yihui Xie (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.23.

Yihui Xie, Joe Cheng and Xianying Tan (2019). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.7. https://CRAN.R-project.org/package=DT

Analysis of funding information present in Scopus and Web of Science - a case of 16 journals of Russian origin

Lutai Aleksei

24 July 2019 г