R Packages Used

This assignment was accomplished by utilizing these packages for both data analysis and visualizations.

library("dplyr")
library("RCurl")
library("XML")
library("xml2")
library("jsonlite")
library("ggplot2")
library("DT")
library("kableExtra")
library("data.table")
library("tidyr")
library("lubridate")
#library("XLConnectJars")
#library("XLConnect")
library("stringr")
library("formattable")
library("aRxiv")
library("tidyverse")
library("rvest")
library("textrank")
library("lattice")
library("igraph")
library("ggraph")
library("wordcloud")
library("curl")
library("treemap")
options(scipen = 999)

Collabration and Team members

Alexander Ng, Arun Reddy, Henry Otuadinma, Jagdish Chhabria

Our team consists of four members. Each one of us collabrated with each other using Webex, Slack, GitHub.

1. Thought Leadership in Data Science

Reading this report in HTML

The reader needs to read both tabs of each section otherwise the paper is confusing.

Overview

DATA SCIENCE THOUGHT LEADERSHIP

Project Objective: The objective of this project is to answer the following questions: 1) Who are today’s “thought leaders” in data science? 2) What are the topics that data scientists care most about? 3) How do these change over time, and across geographical location?

Let’s start by defining the terms: Thought Leader and Data Science.

A thought leader is an individual, organization or nation state that is recognized as an authority in a specialized field and whose expertise is sought and often rewarded. They are trusted sources who move and inspire people with innovative ideas; turn ideas into reality, and know and show how to replicate their success. Thought leaders are commonly asked to speak at public events, conferences or webinars to share their insight with a relevant audience. The Oxford English Dictionary gives as its first citation for the phrase an 1887 description of Henry Ward Beecher as “one of the great thought-leaders in America.

Data science (DS) is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science encompasses the fields of data mining and big data. Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. Data Science is closely related with fields such as Machine Learning (ML), Artificial Intelligence (AI) and Computer Visualization.

There are some inherent contradictions associated with thought leadership. Given that thinking is supposed to be an individual activity, wherein one relies on one’s own logic and intelligence to form an understanding, opinion and make potential choices, it is paradoxical to have external thoughts being anointed as the “leader” or best way to think about a subject. Having said that, the real difficulty arises when trying to decide who or what makes someone a thought leader in any given field. This issue is further compounded if the field in question is a rapidly changing, complex domain such as Data Science. There are no easily available metrics (for example: a Nobel Prize in Data Science) that can be easily referred to, for determining thought leadership.

Outline of paper structure and summary of findings

Given this, we’ve adopted the following approach and structure for this project:

  1. People as Thought Leaders
    1. Based on popular metrics such as number of followers, blogs, tweets, web lists.
    2. Based on academic research papers and publications presented at industry conferences.
  2. Universities as Thought Leaders
    1. Based on faculty research published.
    2. Based on academic enrolment in courses in Data Science and related subjects.
  3. Countries as Thought Leaders
    1. Based on publications.

This report is organized by the entity considered as a thought leader. We begin by analyzing the role of individuals as thought leaders. One section considers individuals based on popular acclaim while the following section considers tangible research metrics such as publication statistics. Next, we examine universities - possibly the preeminent type of organization - for this type of study. This is followed by our evaluation of nations and geographical regions. Then, we focus on trends in research topics looking at which subtopics of data science have waxed and waned. The final section concludes.

We also attempt to show how one could go about determining the sub-topics that individual thought-leaders under 1.a) are most interested in, by taking a deeper dive into scraping and parsing material available on the work.

Strengths, weakness of our and other approaches

Our methods & research relied on the gathering the information from Research publications, trends in the topic of interests, Geographic trends, academic metrics, popular narratives. Most of the methods and analysis are based solely on either analysis based on Social media or University research which might not give the complete 360 degrees. For instance, social media analysis can tell who’s the famous person and who has the most following but it doesn’t say that the person or firm in question is thought leader. Nicola Tesla, who was a pioneer in his field has no social media access but his work touched human existence.

3. Thought Leadership Through Research

Introduction and Analysis

Measuring Thought Leadership Through Research Paper Counts

Arxiv website is one of the top electronic paper repositories for academic research in multiple fields: computer science, physics, mathematics and statistics. Researchers submit papers electronically and are catalogued in the arxiv database. By analyzing the level of activity of a researcher in submitting papers on data science topics to arxiv, we get an objective, quantifiable, and relevant measure of their thought leadership.

In the next section, we will describe the data collection process, its limitation and outputs. After showing how the data and raw files are processed, we wrangle the consolidated data into usable form. Then, we present rankings of the top leaders and descriptive statistics of the papers and conclude with some interpretative remarks.

Data Collection Process

We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:

  • submission date
  • authors
  • subject classification (self-described by authors)
  • title of the papers

The range of dates and subject classifications below are native arXiv categories.
Our 7 subject classification categories are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018 Because we discuss the arXiv API at length in another section of this project on time trends in research, we give a brief summary here.

In this section, we identify the specific steps relevant to author page counts.

To obtain authors and titles of papers, we require downloading the full record of each paper submission. This raw data required 1 hour to download through a series of trial and error batch scripts because of two issues:

  1. server limits the number of records returned if the count is too high (over 15000 per pull)
  2. server disables the requestors API access if numerous requests are submitted in parallel or in a short time.

By defining granular queries, we limited most API requests to under 5000 records and successfully gathered all paper records.
This yielded 70 raw files by year and category. We combined them into a single big file in two steps: we aggregated all years into one category file, and all category files into a single big file.

The big file had 7 columns: * ID (unique identifier of the article) * submitted (date/time of submission) * updated (date/time of last revision submitted) * title ( name of the paper) * authors ( a pipe delimited list of co-authors of the papers) * primary_category (used for the query ) * categories (pipe-delimited list of alternate categories)

The most important step to produce a single consolidated records file was eliminating unnecessary fields: the abstract. Each paper’s record includes its abstract. For most papers, an abstract represents 90 percent of the record size. Due to the large file size, this simplification was needed to allow all records to fit into memory on our PC.

The final result is a flat file with 57193 records called output_all_subjects.csv.

Code to download and merge data files

The code to download the required data below has been described in the previous section. Due to the fact that the arXiv server API may produce variable results or throttle access, we show but don’t run the code block below. This is controlled by setting eval=FALSE in the relevant code chunks.

#####  A list of categories and years
#####  -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI" , "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence" , "Computer Vision",
                    "Computer Learning", "Robotics", "Computation and Language" ,
                      "Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
                            desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )
#####  Set up an empty dataframe of years range for row and data science
#####  topics for columns.   Values will store paper counts in arXiv by year and topic.
##### ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0, 
                           nrow = length(years_list), 
                           ncol = length(ds_categories) ,
                           dimnames = list( as.character( years_list ), ds_categories ) ) )

#####   Query the arXiv server for paper counts:  
#####   Outer loop is on subjects
#####  Inner loop is on years.
#####  -------------------------------------------------------------------------
for(subject in subject_names$categories )
{
  for( y in years_list )
  {
    
    
    qry = paste0("cat:", subject, " AND submittedDate:[", 
                 y, 
                 " TO ", 
                 y+1, "]")
    
    qry_count = arxiv_count(qry)
    qry_details = arxiv_search(qry, batchsize = 100, limit = 11000, start = 0 )
    
    info[as.character(y), subject] = qry_count
    print(paste(qry, " ", qry_count, "\n"))
    
    output_filename = paste0(subject, "_", y, "_", "results.csv")
    
    write.csv(qry_details, file = output_filename)
    
    print(paste0("Wrote file: ", output_filename, Sys.time() ) )
    
    #####  Sleeping is essential to throttle API load on the arXiv server.
    #####  ------------------------------------------------------------------
    Sys.sleep(5)
  }
}
print("Retrieval of arXiv query records is now completed.")
for(j in seq_along(ds_categories ) )
{
    subject = ds_categories[j]  
    outputdf = list( )   
    
    my_files = paste0(subject, "_", years_list, "_", "results.csv")
    
    for( i in  seq_along(my_files) )
    {
       fulldata <- read_csv(file = my_files[i])
       print(paste0( "Loaded ", i, " ", my_files[i] ) )
       
       #####   Strip out the abstract which takes up most file space.
       #####  ------------------------------------------------------------------------
       fulldata %>% select( id, submitted, updated, title, authors, primary_category, categories) -> tempdata
       
       outputdf[[i]] = tempdata    
       
       Sys.sleep(1)
    }
    
    #####   Write all the year files for one subject to one tibble and then 
    #####   dump to one subject specific file
    #####  -----------------------------------------
    big_data = bind_rows(outputdf)
    
    output_big_subject = paste0("bigdata_", subject, ".csv")
  
    write_csv(big_data, output_big_subject )
    
    print(paste0( "Wrote file ", output_big_subject, " to disk ", Sys.time() ) )
}
my_files = paste0("bigdata_", ds_categories, ".csv")
  
outputdf = list()
for(j in seq_along(my_files ) )
{
    fulldata <- read_csv(file = my_files[j])
    print(paste0( "Loaded ", j, " ", my_files[j] ) )
    outputdf[[j]] = fulldata    
    Sys.sleep(1)
}
  
#####  We row-bind the list of dataframes into one big one using
#####  a nice one-liner in dplyr.   The result is one big tibble.
##### ---------------------------------------------------------------------
big_data = bind_rows(outputdf)
    
output_all_subjects = "output_all_subjects.csv"
    
write_csv(big_data, output_all_subjects )
    
print(paste0( "Wrote file ", output_all_subjects, " to disk ", Sys.time() ) )
Wrangling the data

The entire analysis in this section depends only on loading the raw files in the next code chunk. We illustrate the content with a few records below.

big_paper_set = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/output_all_subjects.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   submitted = col_datetime(format = ""),
##   updated = col_datetime(format = ""),
##   title = col_character(),
##   authors = col_character(),
##   primary_category = col_character(),
##   categories = col_character()
## )
knitr::kable(head(big_paper_set, 4) ,  
             caption = "Representative Records from the Paper Records" ) %>%
       kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Representative Records from the Paper Records
id submitted updated title authors primary_category categories
0901.0356v1 2009-01-05 06:37:01 2009-01-05 06:37:01 Information, Divergence and Risk for Binary Experiments Mark D. Reid|Robert C. Williamson stat.ML stat.ML|math.ST|stat.TH
0901.1365v1 2009-01-10 12:12:31 2009-01-10 12:12:31 Differential Privacy with Compression Shuheng Zhou|Katrina Ligett|Larry Wasserman stat.ML stat.ML|math.ST|stat.TH
0901.1504v2 2009-01-12 05:02:18 2009-10-13 03:01:23 A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem Bharath Sriperumbudur|David Torres|Gert Lanckriet stat.ML stat.ML|stat.ME
0901.2044v2 2009-01-14 15:34:13 2010-10-21 13:30:14 SPADES and mixture models Florentina Bunea|Alexandre B. Tsybakov|Marten H. Wegkamp|Adrian Barbu math.ST math.ST|stat.ML|stat.TH

Next, we remove duplicate records in the raw data set. Duplicate records arise because a paper may be classified as matching two or more computer science categories. For example, a paper may fall into Statistical Machine learning (stat.ML) and Computer Vision (cs.CV). This removes roughly 12000 duplicate records.

#####  Remove duplicate records and true all information.
#####  -----------------------------------------------------------------------------------
big_paper_clean <- ( big_paper_set %>% distinct( id, authors, .keep_all = TRUE))
nrow(big_paper_clean)
## [1] 45150
paper_authors = big_paper_clean$authors
author_names  = str_split(paper_authors, "\\|")  # separates all the authors
#####  The coauthors of a paper are consecutively listed in preceded by all authors
#####  of earlier papers.
#####  ---------------------------------------------------------------------
authors_unlisted = unlist(author_names)
num_author_paper_tuple = length(authors_unlisted)
#####  Index j corresponds to the j-th paper in big_paper_clean
#####  Value at index j corresponds to the number of co-authors in paper j
#####  ----------------------------------------------------------------------
vec_coauthor_counts = unlist( lapply(author_names, length ) )
paper_author_map = tibble( id  = character(num_author_paper_tuple), author = character(num_author_paper_tuple) )
idx_unlisted = 0

The following code chunk maps the papers to authors in a 1-to-many relationship. Due to the inefficiency of the process, (over 10 minutes) to generate the mapping, I am saving the results to a flat file and setting eval=FALSE. At the next step, the data is reloaded from file to a dataframe for analysis.

for( id_idx in  1:length(big_paper_clean$id)  )
{
     num_coauthors = vec_coauthor_counts[id_idx]
  
     for(s in 1:num_coauthors)
     {
          paper_author_map$id[ idx_unlisted + s ] = big_paper_clean$id[ id_idx]  
          paper_author_map$author[ idx_unlisted + s ] = authors_unlisted [ idx_unlisted + s]
      
     }
     idx_unlisted = idx_unlisted + num_coauthors
     if( id_idx %% 100 == 0 )
     {
         print(paste0(" idx = ", id_idx))
     }
}
write_csv(paper_author_map, "paper_author_map.csv")
paper_author_map = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/paper_author_map.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   author = col_character()
## )

Summary and Conclusions

by_author <- group_by( paper_author_map , author )
rankings <- summarize( by_author, numPapers = n() ) %>% arrange( desc( numPapers))
knitr::kable(head(rankings, 30) , caption = "Top 30 Authors by Data Science Paper Counts (2009-2018)")
Top 30 Authors by Data Science Paper Counts (2009-2018)
author numPapers
Yoshua Bengio 174
Sergey Levine 105
Pieter Abbeel 100
Michael I. Jordan 98
Uwe Aickelin 97
Chunhua Shen 89
Francis Bach 80
Kyunghyun Cho 78
Toby Walsh 78
Zoubin Ghahramani 74
Masashi Sugiyama 73
Eric P. Xing 72
Shie Mannor 72
Damien Chablat 69
Max Welling 69
Ruslan Salakhutdinov 69
Aaron Courville 65
Andreas Krause 62
Trevor Darrell 62
Lawrence Carin 61
Chris Dyer 59
Mita Nasipuri 59
Nathan Srebro 59
Roland Siegwart 58
Tong Zhang 56
Yann LeCun 56
Nando de Freitas 54
Bernhard Schölkopf 53
Marcus Hutter 52
Martin J. Wainwright 52
summary( rankings)
##     author            numPapers    
##  Length:67625       Min.   :  1.0  
##  Class :character   1st Qu.:  1.0  
##  Mode  :character   Median :  1.0  
##                     Mean   :  2.2  
##                     3rd Qu.:  2.0  
##                     Max.   :174.0

We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. By comparison, other thought leaders mentioned earlier like Kira Radinsky have written only 4 papers. We also see that the average number of papers written is 2.2 with a median of 1 papers. Thus, the distribution of publishing researchers is highly skewed to the right.

We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.


6. Academia: Top Ranked Universities in AI, ML or DS

Introduction and Analysis

The section below loads the required packages and input data files containing the following details of research papers in Artificial Intelligence (AI), Data Science (DS), Machine Learning (ML), Visualization (VI): Names of faculty members who authored these papers, universities they’re affiliated to, conferences they presented at, and the year of publication. The conferences selected were restricted to those in the above fields only.

fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/author-info-ds.csv"
ds.authors<-read.csv(file=fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = TRUE)
#ds.authors<-fread(fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = FALSE)
#ds.authors

The section below filters the data for the 2 main columns of interest: University and Adjusted Count. The adjusted count is a score that measures contribution by authors based on joint ownership with other authors. The detailed methodology is available at http://csrankings.org/#/index?all

The analysis below aims to determine which university across the globe, can be deemed to be a thought leader in the Data Science, Artificial Intelligence, Machine Learning and Visualization areas, based on the contribution of their faculty members by writing research papers.

The following section selects the required columns, and calculates an aggregate of the adjusted count by university.Then it renames the columns, and derives the top 10 universities based on this adjusted count metric.

paper.count<-ds.authors%>%select(university,adjustedcount)
summary.paper.count<-aggregate(. ~ university, data = paper.count, sum)%>%setorder(-adjustedcount)
summary.paper.count$adjustedcount<-round(summary.paper.count$adjustedcount,2)
colnames(summary.paper.count)<-c("University", "Research_Adj_Count")
#summary.paper.count
top10<-summary.paper.count[1:10,]
top10
##                                University Research_Adj_Count
## 26             Carnegie Mellon University             515.84
## 254   University of California - Berkeley             250.29
## 352                   University of Tokyo             239.10
## 65        Georgia Institute of Technology             204.46
## 122 Massachusetts Institute of Technology             200.75
## 338     University of Southern California             198.27
## 328            University of Pennsylvania             194.29
## 242                 University of Alberta             173.05
## 212                             TU Munich             172.86
## 38                     Cornell University             167.04

The following section generates a barplot showing the top 10 universities by adjusted count of research papers. It shows that Carnegie Mellon sits at the top of the stack by a big margin. So it can considered as the prime thought leader from an institutional perspective.

As a topic for further research, it is notable that a big name like Stanford University is missing from the top 10 universities. We suspect that this could be on account of factors like departmental affiliations of faculty members and their choices on whether to present their research at pure Data Science type conferences vis-a-vis other conferences geared towards other domains such as Statistics or Economics. Also, it’s likely that if the focus is extended to all of Computer Science instead of a narrower selection of AI, DS, ML etc, then universities like Stanford may show a more significant presence while perhaps the universities in the top 10 are more exclusively focusing on AI, ML and DS research.

library(ggplot2)
library(RColorBrewer)
ggplot(top10, aes(x=reorder(University,Research_Adj_Count), y=Research_Adj_Count, fill=University))+ geom_bar(stat="identity",color="black") + coord_flip() + theme(legend.position='none') + ylab("Adjusted Research Paper Count") + xlab("Universities as Thought Leaders")

Percentage of AI and ML course enrollments in US Universities at the Undergraduate Level
fileurl2<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/USAI-MLUndergradEnrolmentPercentage.csv"
undergrad<-read.csv(fileurl2)
colnames(undergrad)<-c("University", "Domain", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017")
#undergrad
undergrad.long<-gather(undergrad, key="Year", value=Percent_Share, 3:10)
undergrad.long$UnivDomain=paste(undergrad.long$University, undergrad.long$Domain, sep="-")
undergrad.long$Percent_Share=round(undergrad.long$Percent_Share*100,2)
undergrad.long
##    University Domain Year Percent_Share  UnivDomain
## 1    Berkeley     AI 2010             2 Berkeley-AI
## 2    Stanford     AI 2010             3 Stanford-AI
## 3        UIUC     AI 2010             1     UIUC-AI
## 4          UW     AI 2010             0       UW-AI
## 5    Berkeley     ML 2010             0 Berkeley-ML
## 6    Stanford     ML 2010             5 Stanford-ML
## 7        UIUC     ML 2010             0     UIUC-ML
## 8          UW     ML 2010             0       UW-ML
## 9    Berkeley     AI 2011             2 Berkeley-AI
## 10   Stanford     AI 2011             3 Stanford-AI
## 11       UIUC     AI 2011             0     UIUC-AI
## 12         UW     AI 2011             0       UW-AI
## 13   Berkeley     ML 2011             0 Berkeley-ML
## 14   Stanford     ML 2011             5 Stanford-ML
## 15       UIUC     ML 2011             0     UIUC-ML
## 16         UW     ML 2011             0       UW-ML
## 17   Berkeley     AI 2012             3 Berkeley-AI
## 18   Stanford     AI 2012             3 Stanford-AI
## 19       UIUC     AI 2012             1     UIUC-AI
## 20         UW     AI 2012             0       UW-AI
## 21   Berkeley     ML 2012             0 Berkeley-ML
## 22   Stanford     ML 2012             8 Stanford-ML
## 23       UIUC     ML 2012             0     UIUC-ML
## 24         UW     ML 2012             0       UW-ML
## 25   Berkeley     AI 2013             3 Berkeley-AI
## 26   Stanford     AI 2013             3 Stanford-AI
## 27       UIUC     AI 2013             1     UIUC-AI
## 28         UW     AI 2013             1       UW-AI
## 29   Berkeley     ML 2013             1 Berkeley-ML
## 30   Stanford     ML 2013            10 Stanford-ML
## 31       UIUC     ML 2013             0     UIUC-ML
## 32         UW     ML 2013             0       UW-ML
## 33   Berkeley     AI 2014             4 Berkeley-AI
## 34   Stanford     AI 2014             5 Stanford-AI
## 35       UIUC     AI 2014             1     UIUC-AI
## 36         UW     AI 2014             1       UW-AI
## 37   Berkeley     ML 2014             1 Berkeley-ML
## 38   Stanford     ML 2014            11 Stanford-ML
## 39       UIUC     ML 2014             1     UIUC-ML
## 40         UW     ML 2014             1       UW-ML
## 41   Berkeley     AI 2015             4 Berkeley-AI
## 42   Stanford     AI 2015             8 Stanford-AI
## 43       UIUC     AI 2015             1     UIUC-AI
## 44         UW     AI 2015             1       UW-AI
## 45   Berkeley     ML 2015             3 Berkeley-ML
## 46   Stanford     ML 2015            13 Stanford-ML
## 47       UIUC     ML 2015             0     UIUC-ML
## 48         UW     ML 2015             1       UW-ML
## 49   Berkeley     AI 2016             3 Berkeley-AI
## 50   Stanford     AI 2016             9 Stanford-AI
## 51       UIUC     AI 2016             1     UIUC-AI
## 52         UW     AI 2016             1       UW-AI
## 53   Berkeley     ML 2016             3 Berkeley-ML
## 54   Stanford     ML 2016             9 Stanford-ML
## 55       UIUC     ML 2016             1     UIUC-ML
## 56         UW     ML 2016             1       UW-ML
## 57   Berkeley     AI 2017             4 Berkeley-AI
## 58   Stanford     AI 2017            12 Stanford-AI
## 59       UIUC     AI 2017             3     UIUC-AI
## 60         UW     AI 2017             2       UW-AI
## 61   Berkeley     ML 2017             2 Berkeley-ML
## 62   Stanford     ML 2017            13 Stanford-ML
## 63       UIUC     ML 2017             2     UIUC-ML
## 64         UW     ML 2017             1       UW-ML

The following graph shows the academic enrolment at the undergraduate level in AI and ML courses in selected US universities, over the 2010-2017 period. From this, it can be seen that academic enrolment has been trending up over the past few years in these universities, which can be seen as representative across all US universities.

library(ggplot2)
ggplot(undergrad.long, aes(x=Year, y=Percent_Share, group=UnivDomain,colour=UnivDomain)) + geom_line()+xlab("Years")+ylab("Percent of Total")

Regions and Countries as Thought Leaders

The following section collects the inputs: regional percentage share of Artificial Intelligence publications over the 1998-2017 period. The data is loaded and tidied from a wide format to a long format, setting it up for further analysis.

fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/RegionalShareofAIPublications.csv"
regional.ai<-read.csv(fileurl)
colnames(regional.ai)<-c("Region","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011", "2012","2013","2014","2015","2016","2017")
#regional.ai
regional.ai.long<-gather(regional.ai, key="Year", value=Percentage_Share, 2:21)
regional.ai.long
##           Region Year Percentage_Share
## 1          China 1998             0.08
## 2  United States 1998             0.28
## 3         Europe 1998             0.35
## 4  Rest of World 1998             0.28
## 5          China 1999             0.06
## 6  United States 1999             0.27
## 7         Europe 1999             0.35
## 8  Rest of World 1999             0.32
## 9          China 2000             0.09
## 10 United States 2000             0.25
## 11        Europe 2000             0.35
## 12 Rest of World 2000             0.31
## 13         China 2001             0.10
## 14 United States 2001             0.23
## 15        Europe 2001             0.36
## 16 Rest of World 2001             0.31
## 17         China 2002             0.12
## 18 United States 2002             0.24
## 19        Europe 2002             0.34
## 20 Rest of World 2002             0.30
## 21         China 2003             0.14
## 22 United States 2003             0.23
## 23        Europe 2003             0.35
## 24 Rest of World 2003             0.28
## 25         China 2004             0.16
## 26 United States 2004             0.22
## 27        Europe 2004             0.33
## 28 Rest of World 2004             0.28
## 29         China 2005             0.18
## 30 United States 2005             0.22
## 31        Europe 2005             0.32
## 32 Rest of World 2005             0.27
## 33         China 2006             0.20
## 34 United States 2006             0.19
## 35        Europe 2006             0.33
## 36 Rest of World 2006             0.28
## 37         China 2007             0.19
## 38 United States 2007             0.18
## 39        Europe 2007             0.33
## 40 Rest of World 2007             0.30
## 41         China 2008             0.27
## 42 United States 2008             0.15
## 43        Europe 2008             0.30
## 44 Rest of World 2008             0.28
## 45         China 2009             0.29
## 46 United States 2009             0.14
## 47        Europe 2009             0.31
## 48 Rest of World 2009             0.26
## 49         China 2010             0.27
## 50 United States 2010             0.14
## 51        Europe 2010             0.31
## 52 Rest of World 2010             0.28
## 53         China 2011             0.25
## 54 United States 2011             0.15
## 55        Europe 2011             0.31
## 56 Rest of World 2011             0.29
## 57         China 2012             0.24
## 58 United States 2012             0.15
## 59        Europe 2012             0.31
## 60 Rest of World 2012             0.30
## 61         China 2013             0.24
## 62 United States 2013             0.15
## 63        Europe 2013             0.32
## 64 Rest of World 2013             0.29
## 65         China 2014             0.24
## 66 United States 2014             0.15
## 67        Europe 2014             0.31
## 68 Rest of World 2014             0.29
## 69         China 2015             0.23
## 70 United States 2015             0.18
## 71        Europe 2015             0.32
## 72 Rest of World 2015             0.28
## 73         China 2016             0.24
## 74 United States 2016             0.17
## 75        Europe 2016             0.30
## 76 Rest of World 2016             0.30
## 77         China 2017             0.25
## 78 United States 2017             0.17
## 79        Europe 2017             0.28
## 80 Rest of World 2017             0.30
Countries as Thought Leaders

The following section shows a regional breakdown of AI papers published on Scopus for the uears 1998-2017. The source of this data is Elsevier. The broad regional categories are: USA, Europe, China and Rest of World (RoW). Based on this, it can be seen that Europe is the leading contributor to papers and publications in this domain over the years followed closely by RoW. China can be seen steadily increasing its share of research publications in this area. Based on this metric, Europe can be considered as the Thought Leader from a regional perspective.

ggplot(regional.ai.long, aes(x=Year, y=Percentage_Share, fill=Region)) +
geom_bar(stat="identity", colour="black") +
guides(fill=guide_legend(reverse=TRUE)) +
scale_fill_brewer(palette="Pastel1") + theme(text = element_text(size=11),axis.text.x = element_text(angle=90, hjust=1))

Summary and Conclusion

  1. Machine Learning, Neural Networks and Computer Visualization are showing a very rapid growth in research publications.
  2. The academic enrolment at the undergraduate level in AI and ML courses in selected US universities, over the 2010-2017 period. From this, it can be seen that academic enrolment has been trending up over the past few years in these universities, which can be seen as representative across all US universities.
  3. Europe is the leading contributor to papers and publications in this domain over the years followed closely by RoW. China can be seen steadily increasing its share of research publications in this area. Based on this metric, Europe can be considered as the Thought Leader from a regional perspective.
  4. Carnegie Mellon sits at the top of the stack by a big margin. So it can considered as the prime thought leader from an institutional perspective. A big name like Stanford University is missing from the top 10 universities. We suspect that this could be on account of factors like departmental affiliations of faculty members and their choices on whether to present their research at pure Data Science type conferences vis-a-vis other conferences geared towards other domains such as Statistics or Economics. Also, it’s likely that if the focus is extended to all of Computer Science instead of a narrower selection of AI, DS, ML etc, then universities like Stanford may show a more significant presence while perhaps the universities in the top 10 are more exclusively focusing on AI, ML and DS research.

7. Conclusion

Based on popular narrative, Andrew Ng and Kira Radinsky can be considered to be two of the primary thought leaders in Data Science. Over the pasy few years, they have shifted their focus from AI to more specific topics of specialisations such as deep learning, predictive analytics, and speech recognition.

We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.

Regarding sub-topics being researched, Computer / Machine Learning and Computer Vision have grown in popularity at the cost of pure AI, while Neural Computing and Robotics have remained largely stable.

When it comes to countries and regions, the USA leads in terms of patents in DS/AI/ML, while Europe leads in terms of generating research publications. China is catching up rapidly - for example Tsingshua University led the way in terms of research papers in this domain during 2018. Universities in the USA are showing a steady increase in enrolment in AI/ML courses, and Carnegie Melon continutes to lead the way in research in this domain.

We relied on a variety of sources for the data used in our analysis, such as the AI Index Report for 2018, the CS Rankings, aRxiv and ACM websites, as well as personal websites of individuals such as Andrew Ng. In terms of future research, we think that understanding what’s driving the changes in topics of interest for popular data science professionals as well as academia would be a good area to focus on.