library(tidyverse)
library(knitr)
library(aRxiv)
library(kableExtra)
Arxiv website is one of the top electronic paper repositories for academic research in multiple fields: computer science, physics, mathematics and statistics. Researchers submit papers electronically and are catalogued in the arxiv database. By analyzing the level of activity of a researcher in submitting papers on data science topics to arxiv, we get an objective, quantifiable, and relevant measure of their thought leadership.
In the next section, we will describe the data collection process, its limitation and outputs. After showing how the data and raw files are processed, we wrangle the consolidated data into usable form. Then, we present rankings of the top leaders and descriptive statistics of the papers and conclude with some interpretative remarks.
We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:
The range of dates and subject classifications below are native arXiv categories.
Our 7 subject classification categories are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018 Because we discuss the arXiv API at length in another section of this project on time trends in research, we give a brief summary here.
In this section, we identify the specific steps relevant to author page counts.
To obtain authors and titles of papers, we require downloading the full record of each paper submission. This raw data required 1 hour to download through a series of trial and error batch scripts because of two issues:
By defining granular queries, we limited most API requests to under 5000 records and successfully gathered all paper records.
This yielded 70 raw files by year and category. We combined them into a single big file in two steps: we aggregated all years into one category file, and all category files into a single big file.
The big file had 7 columns: * ID (unique identifier of the article) * submitted (date/time of submission) * updated (date/time of last revision submitted) * title ( name of the paper) * authors ( a pipe delimited list of co-authors of the papers) * primary_category (used for the query ) * categories (pipe-delimited list of alternate categories)
The most important step to produce a single consolidated records file was eliminating unnecessary fields: the abstract. Each paper’s record includes its abstract. For most papers, an abstract represents 90 percent of the record size. Due to the large file size, this simplification was needed to allow all records to fit into memory on our PC.
The final result is a flat file with 57193 records called output_all_subjects.csv.
The code to download the required data below has been described in the previous section. Due to the fact that the arXiv server API may produce variable results or throttle access, we show but don’t run the code block below. This is controlled by setting eval=FALSE in the relevant code chunks.
# A list of categories and years
# -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI" , "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence" , "Computer Vision",
"Computer Learning", "Robotics", "Computation and Language" ,
"Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )
# Set up an empty dataframe of years range for row and data science
# topics for columns. Values will store paper counts in arXiv by year and topic.
# ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0,
nrow = length(years_list),
ncol = length(ds_categories) ,
dimnames = list( as.character( years_list ), ds_categories ) ) )
#
# Query the arXiv server for paper counts:
# Outer loop is on subjects
# Inner loop is on years.
# -------------------------------------------------------------------------
for(subject in subject_names$categories )
{
for( y in years_list )
{
qry = paste0("cat:", subject, " AND submittedDate:[",
y,
" TO ",
y+1, "]")
qry_count = arxiv_count(qry)
qry_details = arxiv_search(qry, batchsize = 100, limit = 11000, start = 0 )
info[as.character(y), subject] = qry_count
print(paste(qry, " ", qry_count, "\n"))
output_filename = paste0(subject, "_", y, "_", "results.csv")
write.csv(qry_details, file = output_filename)
print(paste0("Wrote file: ", output_filename, Sys.time() ) )
# Sleeping is essential to throttle API load on the arXiv server.
# ------------------------------------------------------------------
Sys.sleep(5)
}
}
print("Retrieval of arXiv query records is now completed.")
for(j in seq_along(ds_categories ) )
{
subject = ds_categories[j]
outputdf = list( )
my_files = paste0(subject, "_", years_list, "_", "results.csv")
for( i in seq_along(my_files) )
{
fulldata <- read_csv(file = my_files[i])
print(paste0( "Loaded ", i, " ", my_files[i] ) )
# Strip out the abstract which takes up most file space.
# ------------------------------------------------------------------------
fulldata %>% select( id, submitted, updated, title, authors, primary_category, categories) -> tempdata
outputdf[[i]] = tempdata
Sys.sleep(1)
}
# Write all the year files for one subject to one tibble and then
# dump to one subject specific file
# -----------------------------------------
big_data = bind_rows(outputdf)
output_big_subject = paste0("bigdata_", subject, ".csv")
write_csv(big_data, output_big_subject )
print(paste0( "Wrote file ", output_big_subject, " to disk ", Sys.time() ) )
}
my_files = paste0("bigdata_", ds_categories, ".csv")
outputdf = list()
for(j in seq_along(my_files ) )
{
fulldata <- read_csv(file = my_files[j])
print(paste0( "Loaded ", j, " ", my_files[j] ) )
outputdf[[j]] = fulldata
Sys.sleep(1)
}
# We row-bind the list of dataframes into one big one using
# a nice one-liner in dplyr. The result is one big tibble.
#---------------------------------------------------------------------
big_data = bind_rows(outputdf)
output_all_subjects = "output_all_subjects.csv"
write_csv(big_data, output_all_subjects )
print(paste0( "Wrote file ", output_all_subjects, " to disk ", Sys.time() ) )
The entire analysis in this section depends only on loading the raw files in the next code chunk. We illustrate the content with a few records below.
big_paper_set = read_csv("output_all_subjects.csv")
knitr::kable(head(big_paper_set, 4) ,
caption = "Representative Records from the Paper Records" ) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| id | submitted | updated | title | authors | primary_category | categories |
|---|---|---|---|---|---|---|
| 0901.0356v1 | 2009-01-05 06:37:01 | 2009-01-05 06:37:01 | Information, Divergence and Risk for Binary Experiments | Mark D. Reid|Robert C. Williamson | stat.ML | stat.ML|math.ST|stat.TH |
| 0901.1365v1 | 2009-01-10 12:12:31 | 2009-01-10 12:12:31 | Differential Privacy with Compression | Shuheng Zhou|Katrina Ligett|Larry Wasserman | stat.ML | stat.ML|math.ST|stat.TH |
| 0901.1504v2 | 2009-01-12 05:02:18 | 2009-10-13 03:01:23 | A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem | Bharath Sriperumbudur|David Torres|Gert Lanckriet | stat.ML | stat.ML|stat.ME |
| 0901.2044v2 | 2009-01-14 15:34:13 | 2010-10-21 13:30:14 | SPADES and mixture models | Florentina Bunea|Alexandre B. Tsybakov|Marten H. Wegkamp|Adrian Barbu | math.ST | math.ST|stat.ML|stat.TH |
Next, we remove duplicate records in the raw data set. Duplicate records arise because a paper may be classified as matching two or more computer science categories. For example, a paper may fall into Statistical Machine learning (stat.ML) and Computer Vision (cs.CV). This removes roughly 12000 duplicate records.
# Remove duplicate records and true all information.
# -----------------------------------------------------------------------------------
big_paper_clean <- ( big_paper_set %>% distinct( id, authors, .keep_all = TRUE))
nrow(big_paper_clean)
## [1] 45150
paper_authors = big_paper_clean$authors
author_names = str_split(paper_authors, "\\|") # separates all the authors
# The coauthors of a paper are consecutively listed in preceded by all authors
# of earlier papers.
# ---------------------------------------------------------------------
authors_unlisted = unlist(author_names)
num_author_paper_tuple = length(authors_unlisted)
# Index j corresponds to the j-th paper in big_paper_clean
# Value at index j corresponds to the number of co-authors in paper j
# ----------------------------------------------------------------------
vec_coauthor_counts = unlist( lapply(author_names, length ) )
paper_author_map = tibble( id = character(num_author_paper_tuple), author = character(num_author_paper_tuple) )
idx_unlisted = 0
The following code chunk maps the papers to authors in a 1-to-many relationship. Due to the inefficiency of the process, (over 10 minutes) to generate the mapping, I am saving the results to a flat file and setting eval=FALSE. At the next step, the data is reloaded from file to a dataframe for analysis.
for( id_idx in 1:length(big_paper_clean$id) )
{
num_coauthors = vec_coauthor_counts[id_idx]
for(s in 1:num_coauthors)
{
paper_author_map$id[ idx_unlisted + s ] = big_paper_clean$id[ id_idx]
paper_author_map$author[ idx_unlisted + s ] = authors_unlisted [ idx_unlisted + s]
}
idx_unlisted = idx_unlisted + num_coauthors
if( id_idx %% 100 == 0 )
{
print(paste0(" idx = ", id_idx))
}
}
write_csv(paper_author_map, "paper_author_map.csv")
paper_author_map = read_csv("paper_author_map.csv")
by_author <- group_by( paper_author_map , author )
rankings <- summarize( by_author, numPapers = n() ) %>% arrange( desc( numPapers))
knitr::kable(head(rankings, 30) , caption = "Top 30 Authors by Data Science Paper Counts (2009-2018)")
| author | numPapers |
|---|---|
| Yoshua Bengio | 174 |
| Sergey Levine | 105 |
| Pieter Abbeel | 100 |
| Michael I. Jordan | 98 |
| Uwe Aickelin | 97 |
| Chunhua Shen | 89 |
| Francis Bach | 80 |
| Kyunghyun Cho | 78 |
| Toby Walsh | 78 |
| Zoubin Ghahramani | 74 |
| Masashi Sugiyama | 73 |
| Eric P. Xing | 72 |
| Shie Mannor | 72 |
| Damien Chablat | 69 |
| Max Welling | 69 |
| Ruslan Salakhutdinov | 69 |
| Aaron Courville | 65 |
| Andreas Krause | 62 |
| Trevor Darrell | 62 |
| Lawrence Carin | 61 |
| Chris Dyer | 59 |
| Mita Nasipuri | 59 |
| Nathan Srebro | 59 |
| Roland Siegwart | 58 |
| Tong Zhang | 56 |
| Yann LeCun | 56 |
| Nando de Freitas | 54 |
| Bernhard Schölkopf | 53 |
| Marcus Hutter | 52 |
| Martin J. Wainwright | 52 |
summary( rankings)
## author numPapers
## Length:67625 Min. : 1.0
## Class :character 1st Qu.: 1.0
## Mode :character Median : 1.0
## Mean : 2.2
## 3rd Qu.: 2.0
## Max. :174.0
We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. By comparison, other thought leaders mentioned earlier like Kira Radinsky have written only 4 papers. We also see that the average number of papers written is 2.2 with a median of 1 papers. Thus, the distribution of publishing researchers is highly skewed to the right.
We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, within conceptual innovations made possible by academia, the application of these ideas to business is impossible.