Intro

Elsevier has recently released 2018 CiteScore values for the journals indexed in Scopus. The title list can be found here https://www.elsevier.com/solutions/scopus/resource-library. As our post-DORA world seems to be getting cured from the impact fever, there are some countries where the governments continue to give legal power to the quartile-based decisions. Sad, but true. Russian Federation is among those countries, and many scientists seem to believe that quartile-assessment can be a right choice since the national presence in the “predatory” titles remain substantial.

Therefore, I ventured to look at the changes in the CiteScore list, focusing on how the percentiles & quartiles of the titles have changed in 2018 (compared to 2017). This is not a piece of research, of course, just an excercise - so I am not going to publish anywhere except Rpubs.

Reading the List

Scopus title list comprises of 5 sheets, we are going to open the first one with the journal metrics and the last one with ASJC classifications. Let’s start with the latter sheet that actually comprises 2 tables that we need to merge.

sheets <- readxl::excel_sheets ("D:/data/Journal_info/ext_list_May_2019.xlsx")

asjclist<-readxl::read_xlsx("D:/data/Journal_info/ext_list_May_2019.xlsx",  sheet = "ASJC classification codes")

asjc_table <- asjclist %>%
  select(1,2) %>% drop_na() %>% setNames(c("asjc","asjcname")) %>% 
  mutate(super=as.numeric(paste0(substr(asjc,1,2),"00"))) %>% 
  left_join(asjclist %>% select(4,5) %>% drop_na() %>% setNames(c("super","supercategory"))) %>% 
  transmute_all(as.character)

head(asjc_table, 5)

Now back to the first one:

data_j <- readxl::read_xlsx("D:/data/Journal_info/ext_list_May_2019.xlsx",  
                            sheet = "Scopus Sources April 2019") %>% 
  select(1,2,5,6,8,9,10,11,16, 23,24,25) %>%   # only the important columns
  setNames(c("srcid", "title", "active", "discon", 
             "lang",  "cs2016", "cs2017", "cs2018", 
             "srctype", "publisher", "country", "asjc"))
head(data_j,5)

Let’s brush it up a little:

data_j[6:8]<-lapply(data_j[6:8],  function(x) as.numeric(gsub("NA",NA, x)))
# filtering out the titles having non-NA CiteScore in 2017 or 2018  
data_j1<- data_j %>% filter(!is.na(cs2017)|!is.na(cs2017))
NROW(data_j1)
## [1] 23130

Some titles have status “Inactive” and some of them are also “Discontinued”. Let’s have a look how many are actually discontinued…

data_j1 %>%  
  filter(is.na(cs2018)==TRUE & is.na(cs2017)==FALSE) %>% 
  group_by(active, discon) %>% 
  summarize(n=n_distinct(srcid)) %>% print()
## # A tibble: 3 x 3
## # Groups:   active [2]
##   active   discon                     n
##   <chr>    <chr>                  <int>
## 1 Active   <NA>                      21
## 2 Inactive <NA>                     776
## 3 Inactive Discontinued by Scopus   130

Only 130 titles are marked as Discontinued, leaving us with a question about the reasons that made other 776 titles inactive. I assume that they just naturally withered away. As for 21 titles that are still shown as Active, got 2017 CiteScore in 2017, but somehow left without 2018 CiteScore, there is no answer.

Making the list tidy

ASJC column lists all the subject categories that the journal is related to. Some journals are registered to many categories. The ASJC codes in the field are separated mostly with “;”. Mostly. So we need to make it more regular in order to summarize further.

data_j2<- data_j1 %>% 
  # extracting 4-digit sequences and re-annealing back with paste0, separated by semicolon
  mutate(asjc= sapply(asjc, function(x) paste0(unlist(str_extract_all(x, "[0-9]{4}")), collapse=";"))) %>% 
  mutate(asjc=strsplit(asjc, split=";")) %>% 
  unnest(asjc) %>% 
  mutate(asjc=str_trim(asjc)) %>% 
  # and I add ASJC table with supercategory (just to have it here for future)
  left_join(asjc_table %>% select(asjc, asjcname, super, supercategory))

Are we ready to group_by asjc? Well, worth of checking..

data_j2 %>% filter(!asjc %in% asjc_table$asjc) %>% select(title, asjc, asjcname)

I found it funny - either Scopus have some internal (classified) subject categories, or this is someone’s mistype (one of those that disseminate via Excel spreadsheets). I am lazy to check the real assignment of these titles, let’s get rid of them and build the first plot.

data_j2 %>% group_by(super, asjc, asjcname) %>% 
  filter(asjc %in% asjc_table$asjc) %>% 
  summarize(ntitles=n_distinct(srcid)) %>% ungroup() %>%  
  mutate(sc=paste0(super,"_",asjc)) %>% 
  ggplot(aes(x=sc, y=ntitles, fill=super))+ 
  geom_bar(stat="identity", color="grey50", size=0.1)+
  labs(x="", y="Number of titles in 2017-18)",
       title="Subject Categories ASJC")+ 
  theme_classic()+
  scale_y_continuous(breaks=pretty_breaks(n=5), expand=expand_scale(add=0), 
                     labels=format_format(big.mark = " ", decimal.mark = ",", scientific = FALSE))+
  scale_fill_discrete(labels=c("MULT", "AGRI" ,"ARTS", "BIOC" ,"BUSI" ,"CENG" ,"CHEM", "COMP" ,"DECI", "DENT", "EART" ,
                               "ECON", "ENER" ,"ENGI", "ENVI", "HEAL", "IMMU", "MATE", "MATH" ,"MEDI", "NEUR", "NURS", 
                               "PHAR" ,"PHYS" ,"PSYC", "SOCI","VETE"))+
  guides(fill=guide_legend(title="", byrow = TRUE,nrow=2, override.aes = list(size = 1, color="white")))+
  mytheme+
  theme(legend.position = "bottom", 
        axis.text.x=element_blank(),
        legend.text = element_text(size=rel(0.7)),
        legend.key.size = unit(3,"mm"),
        legend.margin = margin(0,0,0,0,"mm"),
        axis.ticks.x = element_blank())

The subject categories vary by number of titles. The size matters as the larger is the subject area, the lower is the effect made by the newcomers or the discontinued journals. If the subject area lists say 10 journals, then adding/removing 2 journals can change the ranking positions more significantly, then the comparable changes would do for a larger subject area (say with 150 titles).

Now let’s measure the changes and introduce some categories:

data_j3 <- data_j2 %>% 
  filter(asjc %in% asjc_table$asjc) %>% 
  group_by(asjc) %>% 
  add_count(asjcname) %>% 
  # using percent_rank to calculate the percentile values 
  #(only for the subject areas having > 4 titles)
  mutate(cs_ppc_17=ifelse(n>3,round(percent_rank(cs2017),2),NA), 
         cs_ppc_18=ifelse(n>3,round(percent_rank(cs2018),2),NA),
         # using ntile to distribute titles into 4 quartiles, 1st is the highest
         q2017 = 5 - ntile(cs_ppc_17,4), q2018 = 5 - ntile(cs_ppc_18,4),
         # change of CiteScore PPC
         cs_ppc_change= cs_ppc_18 - cs_ppc_17,
         # change of Quartile
         q_change=q2018-q2017) %>%
  ungroup() %>% 
  # introducing some categories to describe the changes further 
  mutate(ppc_status=case_when(
    is.na(cs2017)==FALSE & is.na(cs2018)==FALSE & cs_ppc_change >= 0.1 ~ "A.subst.growth",
    is.na(cs2017)==FALSE & is.na(cs2018)==FALSE & cs_ppc_change >= 0 & cs_ppc_change <0.1 ~ "B.growth",
    is.na(cs2017)==FALSE & is.na(cs2018)==FALSE & cs_ppc_change < 0 & cs_ppc_change >=-0.1 ~ "C.decrease",
    is.na(cs2017)==FALSE & is.na(cs2018)==FALSE & cs_ppc_change < -0.1 ~ "D.subst.drop",
    is.na(cs_ppc_18)==TRUE & is.na(cs_ppc_17)==TRUE & n<4 ~ "G.small_category",
    is.na(cs2018)==TRUE & is.na(cs2017)==FALSE ~ "E.exiled",
    is.na(cs2017)==TRUE & is.na(cs2018)==FALSE ~ "F.invited")) %>% 
  mutate(q_status=case_when(q_change<0 ~ "Q_moved_up",
                            q_change==0 ~ "Q_no_change",
                            q_change>0 ~ "Q_slipped_down",
                            is.na(q_change)==TRUE ~ "Q_yet_unknown"))
head(data_j3, 5)

Quartile Changes in the Subject Areas

Now we are ready to make a first aggregating table. We are going to calculate how many titles changed in 2018 its qurtile in the subject area - some moved up, some slipped down, some stayed where they were, and some are new or discontinued, so I put them in the last category named “Q_yet_unknown”. Let’s have a quick look on the general picture.

data_j3 %>% group_by(super, asjc, asjcname, q_status, n) %>% 
  summarize(ntitles=n_distinct(srcid)) %>% ungroup() %>%  
  mutate(sc=paste0(super,"_",asjc)) %>% 
  ggplot(aes(x=sc, y=ntitles, fill=q_status))+ 
  geom_bar(stat="identity", position="fill", color="grey50", size=0.2)+
  labs(x="", y="Number of titles in 2017-18)", title="Subject Categories ASJC")+ 
  scale_y_continuous(breaks=pretty_breaks(n=5), expand = expand_scale(add=0),
                     labels=percent_format(accuracy=1))+
  scale_fill_manual(name="Quartile 17->18 change", 
                    values=c('#8dd3c7','#ffffb3','#bebada','#fb8072'), 
                    labels=c("moved up", "remained unchanged", "slipped down", "new or discontinued titles"))+
  guides(fill=guide_legend(title="", byrow = TRUE,nrow=1, override.aes = list(size = 1, color="white")))+
  mytheme+
  theme(legend.position = "bottom", 
        plot.background = element_rect(fill=NA),
        panel.grid = element_blank(),
        axis.text.x=element_blank(),
        legend.text = element_text(size=rel(0.7)),
        legend.key.size = unit(.3,"cm"),
        legend.margin = margin(0,0,0,0,"mm"),
        axis.ticks.x = element_blank())

And I use a concept of “volatility” to compare the proportion of titles that moved anywhere (into the other quartiles, into or away from the list) to proportion of titles that remain where they were. Another metric is a direction that compare the polarity of changes.

table1<-data_j3 %>% group_by(super, asjc, asjcname, q_status, n) %>% 
  summarize(ntitles=n_distinct(srcid)) %>% ungroup() %>%  
  spread(q_status, ntitles) %>% arrange(desc(n)) %>% 
  mutate(volatility = round((Q_moved_up + Q_slipped_down + Q_yet_unknown)/n,2),
         direction= round((Q_moved_up - Q_slipped_down)/n,2))

datatable(table1, escape=FALSE,
          rownames = FALSE, extensions = 'Buttons',
          options = list(pageLength = 5,
                         lengthMenu = c(5, 10, 20),
                         dom = 'Bfrtip', buttons = c('csv')
                         ))

Let’s plot top-30 categories that have the most dramatic changes in quartiles:

table1 %>% arrange(desc(volatility)) %>% drop_na(volatility) %>% head(30) %>% 
  ggplot(aes(x=reorder(asjc, volatility), y=volatility, fill=direction>0))+ 
  geom_bar(stat="identity", color="grey50", size=0.2)+
  geom_text(aes(x=reorder(asjc, volatility), y=0.01, label=asjcname), size=3, hjust=0,vjust=0.5, fontface="bold")+
  labs(x="", y="Proportion of Titles in Subject Area\nthat changed its CiteScore Quartile", 
       title="Top-30 Subject Areas with the Highest CiteScore Volatility",
       subtitle="(volatily = change of the CiteScore Quartiles)")+
  coord_flip()+
  scale_y_continuous(breaks=pretty_breaks(n=5), expand = expand_scale(add=c(0,0.01)),
                     labels=percent_format(accuracy=1))+
  scale_fill_manual(name="CHANGE", limits=c(TRUE, FALSE), 
                    values=c('coral','lightblue'), 
                    labels=c("more titles\nmoved up", "more titles\nslipped down"))+
  guides(fill=guide_legend(title="",ncol=1, label.position = "bottom",
                           override.aes = list(size = 1, color="white")))+
  theme_bw()+mytheme+
  theme(legend.position = "right", 
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.key=element_blank(),
        legend.key.size = unit(1, "cm"),
        axis.ticks.x = element_blank())

Apparently, some Russian researchers who publish in the Classics journals have high chances to be disappointed soon, as 50% of titles in Classics changed its quartiles in 2018.

Percentile Changes in the Subject Areas

Change of the CiteScore Percentile can be significant but not lead to the change of quartile (e.g. from 52 to 72 percentile). Therefore it is also interesting to see on the dynamics of CiteScore values within the category. As in the previous case we a volatility, here it accounts for a proportion of the titles whose CiteScore Percentile changed more than by 10 positions. Direction is again the difference between number of titles moved up and that slipped down.

table2 <- data_j3 %>% group_by(super, asjc, asjcname, ppc_status, n) %>% 
  summarize(ntitles=n_distinct(srcid)) %>% ungroup() %>%
  spread(ppc_status, ntitles, fill=0) %>% arrange(desc(n)) %>% 
  mutate(volatility = round((A.subst.growth + D.subst.drop)/n,2), 
         direction = round((A.subst.growth + B.growth - C.decrease - D.subst.drop)/n,2)) %>% 
  arrange(desc(volatility))

datatable(table2 %>% select(-super), escape=FALSE,
          rownames = FALSE, extensions = 'Buttons',
          options = list(pageLength = 5,
                         lengthMenu = c(5, 10, 20),
                         dom = 'Bfrtip', buttons = c('csv')
                         ))

Let’s plot top-30 categories that have the most dramatic changes in :

table2 %>% arrange(desc(volatility)) %>% drop_na(volatility) %>% head(30) %>% 
  ggplot(aes(x=reorder(asjc, volatility), y=volatility/100, fill=direction>0))+ 
  geom_bar(stat="identity", color="grey50", size=0.2)+
  geom_text(aes(x=reorder(asjc, volatility), y=0.0001, label=asjcname), size=3, hjust=0,vjust=0.5, fontface="bold")+
  labs(x="", y="Proportion of Titles in Subject Area\nwith CiteScore Percentile changed by 10% or more", 
       title="Top-30 Subject Areas with the Highest CiteScore Volatility",
       subtitle="(volatily = change of the CiteScore Percentile by 10%)")+
  coord_flip()+
  scale_y_continuous(breaks=pretty_breaks(n=5), expand = expand_scale(add=c(0,0.01)),
                     labels=percent_format(accuracy=1))+
  scale_fill_manual(name="CHANGE", limits=c(TRUE, FALSE), 
                    values=c('coral','lightblue'), 
                    labels=c("more titles\nmoved up", "more titles\nslipped down"))+
  guides(fill=guide_legend(title="",ncol=1, label.position = "bottom",
                           override.aes = list(size = 1, color="white")))+
  theme_bw()+mytheme+
  theme(legend.position = "right", 
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.key=element_blank(),
        legend.key.size = unit(1, "cm"),
        axis.ticks.x = element_blank())

And the last picture, proving that the changes look more dramatic in the small subject areas.

library(ggrepel)
table2 %>% ggplot+ 
  geom_point(aes(x=volatility, y=direction, size=n, fill=super), alpha=0.5, shape=21)+
  geom_hline(yintercept = 0)+
  geom_text_repel(inherit.aes = FALSE, size=3,
                  data=table2 %>% filter(abs(volatility)>0.45|abs(direction)>0.45), 
                  aes(x=volatility, y=direction, label= asjcname))+
  scale_x_continuous(labels=percent_format(accuracy=1))+
  scale_y_continuous(labels=percent_format(accuracy=1))+
  scale_size_continuous(range=c(1,10), name="Number of titles")+
  scale_fill_discrete(name="Subject Category",
                      labels=c("MULT", "AGRI" ,"ARTS", "BIOC" ,"BUSI" ,"CENG" ,"CHEM", "COMP" ,"DECI", "DENT", "EART" ,
                               "ECON", "ENER" ,"ENGI", "ENVI", "HEAL", "IMMU", "MATE", "MATH" ,"MEDI", "NEUR", "NURS", 
                               "PHAR" ,"PHYS" ,"PSYC", "SOCI","VETE"))+
  labs(y="Growth minus Decrease\n[Proportion from number of titles]", 
       x="Proportion of Titles in Subject Area\nwith CiteScore Percentile changed by 10% or more", 
       title="Volatility of the CiteScore Percentile in ASJC subject areas",
       subtitle="(volatily = change of the CiteScore Percentile by 10%)")+
  guides(fill=guide_legend(ncol=2, override.aes = list(size = 3, color="white")),
         size=guide_legend(ncol=1))+
  theme_bw()+mytheme+
  theme(legend.position = "right")

https://twitter.com/Waydze1