Project -3 Data 607

R Packages Used

This assignment was accomplished by utilizing these packages for both data analysis and visualizations.

library("dplyr")
library("RCurl")
library("XML")
library("xml2")
library("jsonlite")
library("ggplot2")
library("DT")
library("kableExtra")
library("data.table")
library("tidyr")
library("lubridate")
#library("XLConnectJars")
#library("XLConnect")
library("stringr")
library("formattable")
library("aRxiv")
library("tidyverse")
library("rvest")
library("textrank")
library("lattice")
library("igraph")
library("ggraph")
library("wordcloud")
library("curl")
library("treemap")
options(scipen = 999)

Collabration and Team members

Alexander Ng, Arun Reddy, Henry Otuadinma, Jagdish Chhabria

Our team consists of four members. Each one of us collabrated with each other using Webex, Slack, GitHub.

1. Thought Leadership in Data Science

Reading this report in HTML

The reader needs to read both tabs of each section otherwise the paper is confusing.

Overview

DATA SCIENCE THOUGHT LEADERSHIP

Project Objective: The objective of this project is to answer the following questions: 1) Who are today’s “thought leaders” in data science? 2) What are the topics that data scientists care most about? 3) How do these change over time, and across geographical location?

Let’s start by defining the terms: Thought Leader and Data Science.

A thought leader is an individual, organization or nation state that is recognized as an authority in a specialized field and whose expertise is sought and often rewarded. They are trusted sources who move and inspire people with innovative ideas; turn ideas into reality, and know and show how to replicate their success. Thought leaders are commonly asked to speak at public events, conferences or webinars to share their insight with a relevant audience. The Oxford English Dictionary gives as its first citation for the phrase an 1887 description of Henry Ward Beecher as “one of the great thought-leaders in America.

Data science (DS) is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science encompasses the fields of data mining and big data. Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. Data Science is closely related with fields such as Machine Learning (ML), Artificial Intelligence (AI) and Computer Visualization.

There are some inherent contradictions associated with thought leadership. Given that thinking is supposed to be an individual activity, wherein one relies on one’s own logic and intelligence to form an understanding, opinion and make potential choices, it is paradoxical to have external thoughts being anointed as the “leader” or best way to think about a subject. Having said that, the real difficulty arises when trying to decide who or what makes someone a thought leader in any given field. This issue is further compounded if the field in question is a rapidly changing, complex domain such as Data Science. There are no easily available metrics (for example: a Nobel Prize in Data Science) that can be easily referred to, for determining thought leadership.

Outline of paper structure and summary of findings

Given this, we’ve adopted the following approach and structure for this project:

People as Thought Leaders
1. Based on popular metrics such as number of followers, blogs, tweets, web lists.
2. Based on academic research papers and publications presented at industry conferences.
Universities as Thought Leaders
1. Based on faculty research published.
2. Based on academic enrolment in courses in Data Science and related subjects.
Countries as Thought Leaders
1. Based on publications.

This report is organized by the entity considered as a thought leader. We begin by analyzing the role of individuals as thought leaders. One section considers individuals based on popular acclaim while the following section considers tangible research metrics such as publication statistics. Next, we examine universities - possibly the preeminent type of organization - for this type of study. This is followed by our evaluation of nations and geographical regions. Then, we focus on trends in research topics looking at which subtopics of data science have waxed and waned. The final section concludes.

We also attempt to show how one could go about determining the sub-topics that individual thought-leaders under 1.a) are most interested in, by taking a deeper dive into scraping and parsing material available on the work.

Strengths, weakness of our and other approaches

Our methods & research relied on the gathering the information from Research publications, trends in the topic of interests, Geographic trends, academic metrics, popular narratives. Most of the methods and analysis are based solely on either analysis based on Social media or University research which might not give the complete 360 degrees. For instance, social media analysis can tell who’s the famous person and who has the most following but it doesn’t say that the person or firm in question is thought leader. Nicola Tesla, who was a pioneer in his field has no social media access but his work touched human existence.

2. People Who are thought leaders based on Popular narrative

Introduction and Analysis

We looked up top influencers/thought leaders in data science and this yielded a lot of people but we narrowed them down to the top 10 we think have huge influence in different areas of research and interests in data science.

For deeper insight, we chose to focus on two of them: Andrew Ng and Kira Radinsky. We could retrieve Andrew Ng’s publications from arXiv, while web-scraping was carried out on Association for Computing Machinery website for Kira Radinsky’s publications, which yielded useful information for studies. We extracted the abstracts from their publications to see what topics and areas interest them.**

We curated a list of the top 10 thought leaders and wrote the list to a csv file

thoughtleaders <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/thought_leaders.csv', header = TRUE)

datatable(thoughtleaders, colnames= c("Name", "Occupation", "Link"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

We generated tags we feel feature often in topics that interest them and saved these in a .csv. These will help in filtering appropriate keywords from their publications

#read the keyword tags from csv
tag_ex <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/keyword_tags.csv', header = TRUE)
tag_ex <- as.character(tag_ex$x)

I. Andrew Y. Ng

His personal website

His publications were sourced from the arxive api

#These queries returned different results
aNgArxiv = arxiv_search('au: "Andrew Ng"') 
aNgArxiv1 = arxiv_search('au: "Andrew Y. Ng"')

combine Andrew Ng’s data

aNgDf <- rbind(aNgArxiv, aNgArxiv1)
# Removing the first row because the paper was later withdrawn for corrections
aNgDf = aNgDf[-1,]
row.names(aNgDf) <- NULL

submitted = str_extract(aNgDf$submitted, '\\d+')
anNg <- aNgDf %>% select(title, authors)
anNg['submitted'] <- submitted

datatable(head(anNg), colnames= c("Title", "Author(s)", "Date"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

ta <- vector()
abT <- vector()
ayn <- vector()

for(i in 1: nrow(aNgDf))
{
  row <- aNgDf[i,]
  
  k <- row$abstract %>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower()%>% str_extract_all(tag_ex)%>%unlist()
  
  for(j in 1: length(k))
  {
    ta <- c(ta, row$title)
    ayn <- c(ayn, as.numeric(str_extract(row$submitted, "\\d+")))
    abT <- c(abT, k[j])
  }
}

df <- data.frame(title=ta, year=ayn, keyword=abT)

write to csv

write.csv(df, "an_Ng.csv", row.names=FALSE)

#read from csv
an_df <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/an_Ng.csv', header = TRUE)

aNkeywords <-an_df%>% select(year, keyword)%>% group_by(keyword, year) %>% mutate(frequency = n())%>%unique()

sort keywords in descending order

aNkw <- aNkeywords[order(-aNkeywords$frequency),, drop=FALSE]

datatable(head(aNkw), colnames= c("Year", "Keyword", "Frequency"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Top 20 keywords in Andrew Ng’s publications over the years

dplt <- ggplot(data=head(aNkw, 20), aes(x = year, y=frequency, fill = keyword)) +
  geom_bar(position="fill", stat = "identity") + 
  ggtitle("top Keywords in Andrew Ng's publications over the years") +
   xlab("Year")+
  theme(plot.title = element_text(lineheight = .8, face = "bold"))
 dplt + theme(legend.position="right")

Top 20 keywords in Andrew Ng’s publications without considering the years

topKeyW <-as.data.frame(table(abT))
names(topKeyW)<-c("keyword","frequency")

dplt <- ggplot(data=head(topKeyW, 20), aes(x = reorder(keyword, frequency), y=frequency, fill = "steelblue")) +
  geom_bar(stat = "identity") +
 xlab("Keyword")+
  ylab("Frquency")+
  ggtitle("Andrew Ng's top Keywords without the years") +
  theme(plot.title = element_text(lineheight = .8, face = "bold")) +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 9))+ coord_flip()
 dplt + theme(legend.position="none")

Kira Radinsky

Her personal website

Her publications sourced by searching for her name on Association for Computing Machinery website

# Hand-picked links according to relevance
kradlinks <- c('citation.cfm?id=2491802', 'citation.cfm?id=2187918', 'citation.cfm?id=2493181', 'citation.cfm?id=2491802', 'citation.cfm?id=2187918', 'citation.cfm?id=2493181', 'citation.cfm?id=2433500', 'citation.cfm?id=2433448', 'citation.cfm?id=1963455', 'citation.cfm?id=2187958', 'citation.cfm?id=3192292', 'citation.cfm?id=2433431', 'citation.cfm?id=1487070', 'citation.cfm?id=3219882', 'citation.cfm?id=2348364', 'citation.cfm?id=3096469', 'citation.cfm?id=1935850', 'citation.cfm?id=2422275')

kradTitles <- vector()
kradAbstracts <- vector()
kradYears <- vector()

Make a search on http://dl.acm.org and pull links

khtms <- tryCatch(html_nodes(read_html(curl('https://dl.acm.org/results.cfm?within=owners.owner%3DHOSTED&srt=_score&query=Kira+Radinsky&Go.x=0&Go.y=0', handle = new_handle("useragent" = "Mozilla/5.0"))), 'div.details'), 
         error = function(e){list(result = NA, error = e)})

The above search returned a lot of links but they need to be filtered to get the relevant ones

for(i in 1: length(khtms))
{
  href <- html_attr(html_nodes(khtms[i], 'div.title a'), 'href')
  
  if(href %in% kradlinks)
  {
    
    kradTitles <- c(kradTitles, khtms[i]%>%html_nodes('div.title a')%>% html_text()%>% str_replace_all('\n', '')%>%str_replace_all('\t', '')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower())
    
    kradYears <- c(kradYears, khtms[i]%>%html_nodes('span.publicationDate')%>% html_text()%>% str_replace_all('\n', '')%>%str_replace_all('\t', '')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower())
    
    r <- html_node(read_html(curl(paste('https://dl.acm.org/', href, '&preflayout=flat', sep=''), handle = new_handle("useragent" = "Mozilla/5.0"))), 'div.flatbody')
    
    paragraphs <- html_nodes(r, 'p')
    
    pTexts <- NULL
    
    for(j in 1: length(paragraphs))
    {
      pText <- paragraphs[j]%>% html_text()%>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>% str_replace_all('\"', '')%>%str_trim(side='both')%>%tolower()
      pTexts <- paste(pTexts, o, collapse=",") 
    }
    
    kradAbstracts <- c(kradAbstracts, pText)
    
    Sys.sleep(10)
    
  }
  
}

tt <- vector()
aa <- vector()
yy <- vector()

for(i in 1: length(kradAbstracts))
{
  k <- kradAbstracts[i] %>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower()%>% str_extract_all(tag_ex)%>%unlist()
  
  for(j in 1: length(k))
  {
    tt <- c(tt, kradTitles[i])
    yy <- c(yy, as.numeric(str_extract(kradYears[i], "\\d+")))
    aa <- c(aa, k[j])
  }
}

dfk<- data.frame(title=tt, year=yy, keyword=aa)

write to csv

write.csv(dfk, "kira_radinsky.csv", row.names=FALSE)
 Write all keywords to .csv

write.csv(aa, "kr_keywords.csv", row.names=FALSE)

read from .csv

#read from csv
kira_df <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/kira_radinsky.csv', header = TRUE)

#read from csv
allkw <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/kr_keywords.csv', header = TRUE)
allkw <- as.character(allkw$x)

datatable(head(kira_df), colnames= c("Title", "Year", "Keyword"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

kkeywords <-kira_df%>% select(year, keyword)%>% group_by(keyword, year) %>% mutate(frequency = n())%>%unique()

sort keywords in descending order

kw <- kkeywords[order(-kkeywords$frequency),, drop=FALSE]

datatable(head(kw), colnames= c("Year", "Keyword", "Frequency"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Top 20 keywords in Kira Radinsky’s publications over the years

kw1 <- subset(kw, year != '2014' & year != '2015' & year != '2016')

dplt <- ggplot(data=head(kw1, 20), aes(x = year, y=frequency, fill = keyword)) +
  geom_bar(position="fill", stat = "identity") + 
  ggtitle("top Keywords in Kira Radinsky's publications over the years") +
 xlab("Year")+
  theme(plot.title = element_text(lineheight = .8, face = "bold"))
 dplt + theme(legend.position="right")

Top 20 keywords in Kira Radinsky’s publications without considering the years

kTopics <-as.data.frame(table(allkw))
names(kTopics) <- c('keyword', 'frequency')

dplt <- ggplot(data=head(kTopics, 20), aes(x = reorder(keyword, frequency), y=frequency, fill = "steelblue")) +
  geom_bar(stat = "identity") +
 xlab("Keyword")+
  ylab("Frquency")+
  ggtitle("Kira Radinsky's top Keywords without the years") +
  theme(plot.title = element_text(lineheight = .8, face = "bold")) +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 9))+ coord_flip()
 dplt + theme(legend.position="none")

Summary and Conclusion

From the above representations, it is obvious that both thought leaders focused on AI earlier on but later started shifting their focus to more specific topics of specialisations such as deep learning, predictive analytics, and speech recognition. In all, they have talked about AI more than any other topic because most part of the early stages of their carrier were focused on AI only but they started focusing on more than one areas of interest simultaneously with time.

These two individuals were chosen because we observed that all the leaders followed a similar trend. They start with one broad area of interest first and then start focusing more on more than one specialised topics as the time go by.

We are of the belief that drilled-down insight on the interests of the other thought leaders will yield similar results.

The sources of this study were gotten from their publications/papers and do not reflect in its entirty, their complete interest areas. A more robust approach should also involve mining for appropriate keywords from their tweets, blogs, interviews, and key notes delivered in conferences

The list we curated is based on evidences of influence and dedicated activities these people have put towards data science, therefore, someone else can have their own list different from ours

3. Thought Leadership Through Research

Introduction and Analysis

Measuring Thought Leadership Through Research Paper Counts

Arxiv website is one of the top electronic paper repositories for academic research in multiple fields: computer science, physics, mathematics and statistics. Researchers submit papers electronically and are catalogued in the arxiv database. By analyzing the level of activity of a researcher in submitting papers on data science topics to arxiv, we get an objective, quantifiable, and relevant measure of their thought leadership.

In the next section, we will describe the data collection process, its limitation and outputs. After showing how the data and raw files are processed, we wrangle the consolidated data into usable form. Then, we present rankings of the top leaders and descriptive statistics of the papers and conclude with some interpretative remarks.

Data Collection Process

We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:

submission date
authors
subject classification (self-described by authors)
title of the papers

The range of dates and subject classifications below are native arXiv categories.
Our 7 subject classification categories are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018 Because we discuss the arXiv API at length in another section of this project on time trends in research, we give a brief summary here.

In this section, we identify the specific steps relevant to author page counts.

To obtain authors and titles of papers, we require downloading the full record of each paper submission. This raw data required 1 hour to download through a series of trial and error batch scripts because of two issues:

server limits the number of records returned if the count is too high (over 15000 per pull)
server disables the requestors API access if numerous requests are submitted in parallel or in a short time.

By defining granular queries, we limited most API requests to under 5000 records and successfully gathered all paper records.
This yielded 70 raw files by year and category. We combined them into a single big file in two steps: we aggregated all years into one category file, and all category files into a single big file.

The big file had 7 columns: * ID (unique identifier of the article) * submitted (date/time of submission) * updated (date/time of last revision submitted) * title ( name of the paper) * authors ( a pipe delimited list of co-authors of the papers) * primary_category (used for the query ) * categories (pipe-delimited list of alternate categories)

The most important step to produce a single consolidated records file was eliminating unnecessary fields: the abstract. Each paper’s record includes its abstract. For most papers, an abstract represents 90 percent of the record size. Due to the large file size, this simplification was needed to allow all records to fit into memory on our PC.

The final result is a flat file with 57193 records called output_all_subjects.csv.

Code to download and merge data files

The code to download the required data below has been described in the previous section. Due to the fact that the arXiv server API may produce variable results or throttle access, we show but don’t run the code block below. This is controlled by setting eval=FALSE in the relevant code chunks.

#####  A list of categories and years
#####  -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI" , "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence" , "Computer Vision",
                    "Computer Learning", "Robotics", "Computation and Language" ,
                      "Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
                            desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )

#####  Set up an empty dataframe of years range for row and data science
#####  topics for columns.   Values will store paper counts in arXiv by year and topic.
##### ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0, 
                           nrow = length(years_list), 
                           ncol = length(ds_categories) ,
                           dimnames = list( as.character( years_list ), ds_categories ) ) )

#####   Query the arXiv server for paper counts:  
#####   Outer loop is on subjects
#####  Inner loop is on years.
#####  -------------------------------------------------------------------------
for(subject in subject_names$categories )
{
  for( y in years_list )
  {
    
    
    qry = paste0("cat:", subject, " AND submittedDate:[", 
                 y, 
                 " TO ", 
                 y+1, "]")
    
    qry_count = arxiv_count(qry)
    qry_details = arxiv_search(qry, batchsize = 100, limit = 11000, start = 0 )
    
    info[as.character(y), subject] = qry_count
    print(paste(qry, " ", qry_count, "\n"))
    
    output_filename = paste0(subject, "_", y, "_", "results.csv")
    
    write.csv(qry_details, file = output_filename)
    
    print(paste0("Wrote file: ", output_filename, Sys.time() ) )
    
    #####  Sleeping is essential to throttle API load on the arXiv server.
    #####  ------------------------------------------------------------------
    Sys.sleep(5)
  }
}
print("Retrieval of arXiv query records is now completed.")
for(j in seq_along(ds_categories ) )
{
    subject = ds_categories[j]  
    outputdf = list( )   
    
    my_files = paste0(subject, "_", years_list, "_", "results.csv")
    
    for( i in  seq_along(my_files) )
    {
       fulldata <- read_csv(file = my_files[i])
       print(paste0( "Loaded ", i, " ", my_files[i] ) )
       
       #####   Strip out the abstract which takes up most file space.
       #####  ------------------------------------------------------------------------
       fulldata %>% select( id, submitted, updated, title, authors, primary_category, categories) -> tempdata
       
       outputdf[[i]] = tempdata    
       
       Sys.sleep(1)
    }
    
    #####   Write all the year files for one subject to one tibble and then 
    #####   dump to one subject specific file
    #####  -----------------------------------------
    big_data = bind_rows(outputdf)
    
    output_big_subject = paste0("bigdata_", subject, ".csv")
  
    write_csv(big_data, output_big_subject )
    
    print(paste0( "Wrote file ", output_big_subject, " to disk ", Sys.time() ) )
}
my_files = paste0("bigdata_", ds_categories, ".csv")
  
outputdf = list()
for(j in seq_along(my_files ) )
{
    fulldata <- read_csv(file = my_files[j])
    print(paste0( "Loaded ", j, " ", my_files[j] ) )
    outputdf[[j]] = fulldata    
    Sys.sleep(1)
}
  
#####  We row-bind the list of dataframes into one big one using
#####  a nice one-liner in dplyr.   The result is one big tibble.
##### ---------------------------------------------------------------------
big_data = bind_rows(outputdf)
    
output_all_subjects = "output_all_subjects.csv"
    
write_csv(big_data, output_all_subjects )
    
print(paste0( "Wrote file ", output_all_subjects, " to disk ", Sys.time() ) )

Wrangling the data

The entire analysis in this section depends only on loading the raw files in the next code chunk. We illustrate the content with a few records below.

big_paper_set = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/output_all_subjects.csv")

## Parsed with column specification:
## cols(
##   id = col_character(),
##   submitted = col_datetime(format = ""),
##   updated = col_datetime(format = ""),
##   title = col_character(),
##   authors = col_character(),
##   primary_category = col_character(),
##   categories = col_character()
## )

knitr::kable(head(big_paper_set, 4) ,  
             caption = "Representative Records from the Paper Records" ) %>%
       kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Representative Records from the Paper Records
id	submitted	updated	title	authors	primary_category	categories
0901.0356v1	2009-01-05 06:37:01	2009-01-05 06:37:01	Information, Divergence and Risk for Binary Experiments	Mark D. Reid\|Robert C. Williamson	stat.ML	stat.ML\|math.ST\|stat.TH
0901.1365v1	2009-01-10 12:12:31	2009-01-10 12:12:31	Differential Privacy with Compression	Shuheng Zhou\|Katrina Ligett\|Larry Wasserman	stat.ML	stat.ML\|math.ST\|stat.TH
0901.1504v2	2009-01-12 05:02:18	2009-10-13 03:01:23	A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem	Bharath Sriperumbudur\|David Torres\|Gert Lanckriet	stat.ML	stat.ML\|stat.ME
0901.2044v2	2009-01-14 15:34:13	2010-10-21 13:30:14	SPADES and mixture models	Florentina Bunea\|Alexandre B. Tsybakov\|Marten H. Wegkamp\|Adrian Barbu	math.ST	math.ST\|stat.ML\|stat.TH

Next, we remove duplicate records in the raw data set. Duplicate records arise because a paper may be classified as matching two or more computer science categories. For example, a paper may fall into Statistical Machine learning (stat.ML) and Computer Vision (cs.CV). This removes roughly 12000 duplicate records.

#####  Remove duplicate records and true all information.
#####  -----------------------------------------------------------------------------------
big_paper_clean <- ( big_paper_set %>% distinct( id, authors, .keep_all = TRUE))
nrow(big_paper_clean)

## [1] 45150

paper_authors = big_paper_clean$authors
author_names  = str_split(paper_authors, "\\|")  # separates all the authors
#####  The coauthors of a paper are consecutively listed in preceded by all authors
#####  of earlier papers.
#####  ---------------------------------------------------------------------
authors_unlisted = unlist(author_names)
num_author_paper_tuple = length(authors_unlisted)
#####  Index j corresponds to the j-th paper in big_paper_clean
#####  Value at index j corresponds to the number of co-authors in paper j
#####  ----------------------------------------------------------------------
vec_coauthor_counts = unlist( lapply(author_names, length ) )
paper_author_map = tibble( id  = character(num_author_paper_tuple), author = character(num_author_paper_tuple) )
idx_unlisted = 0

The following code chunk maps the papers to authors in a 1-to-many relationship. Due to the inefficiency of the process, (over 10 minutes) to generate the mapping, I am saving the results to a flat file and setting eval=FALSE. At the next step, the data is reloaded from file to a dataframe for analysis.

for( id_idx in  1:length(big_paper_clean$id)  )
{
     num_coauthors = vec_coauthor_counts[id_idx]
  
     for(s in 1:num_coauthors)
     {
          paper_author_map$id[ idx_unlisted + s ] = big_paper_clean$id[ id_idx]  
          paper_author_map$author[ idx_unlisted + s ] = authors_unlisted [ idx_unlisted + s]
      
     }
     idx_unlisted = idx_unlisted + num_coauthors
     if( id_idx %% 100 == 0 )
     {
         print(paste0(" idx = ", id_idx))
     }
}
write_csv(paper_author_map, "paper_author_map.csv")

paper_author_map = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/paper_author_map.csv")

## Parsed with column specification:
## cols(
##   id = col_character(),
##   author = col_character()
## )

Summary and Conclusions

by_author <- group_by( paper_author_map , author )
rankings <- summarize( by_author, numPapers = n() ) %>% arrange( desc( numPapers))
knitr::kable(head(rankings, 30) , caption = "Top 30 Authors by Data Science Paper Counts (2009-2018)")

Top 30 Authors by Data Science Paper Counts (2009-2018)
author	numPapers
Yoshua Bengio	174
Sergey Levine	105
Pieter Abbeel	100
Michael I. Jordan	98
Uwe Aickelin	97
Chunhua Shen	89
Francis Bach	80
Kyunghyun Cho	78
Toby Walsh	78
Zoubin Ghahramani	74
Masashi Sugiyama	73
Eric P. Xing	72
Shie Mannor	72
Damien Chablat	69
Max Welling	69
Ruslan Salakhutdinov	69
Aaron Courville	65
Andreas Krause	62
Trevor Darrell	62
Lawrence Carin	61
Chris Dyer	59
Mita Nasipuri	59
Nathan Srebro	59
Roland Siegwart	58
Tong Zhang	56
Yann LeCun	56
Nando de Freitas	54
Bernhard Schölkopf	53
Marcus Hutter	52
Martin J. Wainwright	52

summary( rankings)

##     author            numPapers    
##  Length:67625       Min.   :  1.0  
##  Class :character   1st Qu.:  1.0  
##  Mode  :character   Median :  1.0  
##                     Mean   :  2.2  
##                     3rd Qu.:  2.0  
##                     Max.   :174.0

We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. By comparison, other thought leaders mentioned earlier like Kira Radinsky have written only 4 papers. We also see that the average number of papers written is 2.2 with a median of 1 papers. Thus, the distribution of publishing researchers is highly skewed to the right.

We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.

4. Geographic Trends

Introduction and Analysis

Data science covers a tremendous system of themes under its umbrella including Deep learning, IoT, AI, and different others. It is a complete amalgamation of data inference, analysis, algorithm computation and technology to take care of multifaceted business issues. With the unabated expanding notoriety of information science and new mechanical and advanced improvements, the applications and employment of information science are expanding significantly after some time. The accompanying patterns in this field are relied upon to proceed in the coming year too.

Patents

Overview

Universities contribute fundamentally to AI explore in explicit fields, with Chinese colleges dominating. Numerous AI licenses incorporate developments that can be connected in various ventures - media communications, transportation, restorative sciences, individual gadgets, processing and human-PC collaboration (HCI) included profoundly in the related enterprises.

Data & Import Cleansing

patentsURL<-("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Patents.csv")
patents_Raw<-fread(patentsURL,fill = TRUE,header = TRUE,stringsAsFactors = FALSE)
names(patents_Raw)[1]<-"Year"
patents_Raw$Year<-as.character(patents_Raw$Year)
patents_Data<- patents_Raw %>% gather(Countries,PatentCount,-Year)
patents_Data$PatentCount[is.na(patents_Data$PatentCount)]<-0
patents_Data$Year<-as.Date(patents_Data$Year,format("%Y"))
patents_Data$Year<-year(patents_Data$Year)
patents_Data_byCountry<-patents_Data %>% group_by(Countries) %>% summarise(TotalPatents=sum(PatentCount)) %>% arrange(desc(TotalPatents))

Analysis & Observation

# Bar chart showing the Count of patents by Country
ggplot(patents_Data_byCountry,aes(x=reorder(Countries,TotalPatents),y=TotalPatents))+
  geom_col(fill="tomato2",color="black")+
  xlab("Countries")+
  ylab("Total Patents")+
  
  labs(title = "Total AI Patents by Country period: 2004-2014")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

"US is leading country in Total Patents in Artificial intelligence followed by Japan and China."

Robotics

Summary and Analysis

Robo and DataScience

With the advance in data science, the field of robotics has definitely improved to a great extent. data science, AI, and robotics have a pretty much symbiotic relationship. Each enhances the other to power innovative machines and technologies that are making our lives more convenient than ever. The collaboration between data science, AI, and ML has given us things like self-driving cars, smart assistants, robo-surgeons and nurses, and so much more.

Data & Import Cleansing

RoboticsURL<-"https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Robotic%20Installations.csv"
Robotics_Raw <-fread(RoboticsURL,fill = TRUE,header = TRUE,stringsAsFactors = FALSE)
Robotics_DS<-Robotics_Raw[29:nrow(Robotics_Raw),]
Robotics_Data<- Robotics_DS %>% gather(Year,No_of_Robotic_Installations,-Countries)

Analysis & Observation

ggplot(Robotics_Data, aes(x=Year, y=No_of_Robotic_Installations, group=Countries, color=Countries)) +
  geom_line(size=2) + geom_point()+
  scale_color_brewer(palette="Paired")+
  xlab("Years")+
  ylab("# of Robotic Installations")+
  labs(title = "Robotic Installations Regionally by Year")+
  theme_minimal()

" From the line chart it is evident that China is leading in Robotics installations, followed by Europe, Japan, and North America. The trend is increasing every year"

AI Publications

Overview

Artificial Intelligence and Data Science According to the Harvard Business Review, “companies with strong basic analytics - such as sales data and market trends - make breakthroughs in complex and critical areas after layering in artificial intelligence.” AI innovations like those aren’t possible without the right data and specialized data science staff who know how to use it. In 2014, about 30% of AI patents originated in the U.S, followed by South Korea and Japan, which each hold 16% of AI patents. Of the top inventor regions, South Korea and Taiwan have experienced the most growth, with the number of AI patents in 2014 nearly 5x that in 2004 . RAI is defined as the share of a country’s publication output in AI relative to the global share of publications in AI. A value of 1.0 indicates that a country’s research activity in AI corresponds exactly with the global activity in AI. A value higher than 1.0 implies a greater emphasis, while a value lower than 1.0 suggests a lesser focus.

Data & Import Cleansing

RABR<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Research_activity_by_region.csv",stringsAsFactors = FALSE)
names(RABR)<-str_replace_all(names(RABR),"X","")
RABR

##          Region 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## 1         China   8%   6%   9%  10%  12%  14%  16%  18%  20%  19%  27%
## 2 United States  28%  27%  25%  23%  24%  23%  22%  22%  19%  18%  15%
## 3        Europe  35%  35%  35%  36%  34%  35%  33%  32%  33%  33%  30%
##   2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1  29%  27%  25%  24%  24%  24%  23%  24%  25%
## 2  14%  14%  15%  15%  15%  15%  18%  17%  17%
## 3  31%  31%  31%  31%  32%  31%  32%  30%  28%

RABR<-gather(RABR,Year,Percent_of_Publications,-Region)
RABR$Percent_of_Publications<- percent(RABR$Percent_of_Publications,digits = 0)
RABR_2017<-RABR %>% filter(Year %in% c(2017,2016,2015))

Analysis & Observation

ggplot(RABR_2017,aes(x=reorder(Region,Percent_of_Publications),y=Percent_of_Publications,fill=Year))+
  geom_col(position="dodge",color="black")+
  xlab("Region")+
  ylab("% of AI publications")+
  scale_fill_brewer(palette = "Dark1")+
  labs(title = "Percent of AI papers by Countries: 2015 to 2017")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))

## Warning in pal_name(palette, type): Unknown palette Dark1

" Europe is leading overall every year 2015 - 2017 in terms of % of AI Publications followed by China and US.
In China the trend is upward year over year although they are standing second in place"

By Research Area

Overview

The graphs below and on the following page show the number of Scopus papers affiliated with government, corporate, and medical organizations. In 2017, the Chinese government produced nearly 4x the number of AI papers produced by Chinese corporations. China has also experienced a 400% increase in government-affiliated AI papers since 2007, while corporate AI papers only increased by 73% in the same period. In the U.S., a relatively large proportion of total AI papers are corporate. In 2017, the proportion of corporate AI papers in the U.S. was 6.6x the proportion of corporate AI papers in China, and 4.1x that of Europe.

Data & Import Cleansing

RSector<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Research_focus_by_Region_in_AI.csv",stringsAsFactors = FALSE)
RSector_Stg1<-gather(RSector,Country,Relative_Activity_Index,-Research.Sector)
RSector_Stg1$Country<-factor(RSector_Stg1$Country)
RSector_Stg1$Research.Sector<-factor(RSector_Stg1$Research.Sector)
RSector_Stg1<-RSector_Stg1 %>% group_by(Country) %>% mutate(label_y=cumsum(Relative_Activity_Index)-0.2*Relative_Activity_Index)

Analysis & Observation

library(treemap)
treemap(RSector_Stg1,index = c("Country","Research.Sector"),
        vSize = "Relative_Activity_Index",
        algorithm = "pivotSize",
        #lowerbound.cex.labels=1.6,
        title="Research Focus by Region in AI",
        fontsize.labels = c(15,10),
        align.labels = list(c("centre","centre"),c("left","top")))

ggplot(RSector_Stg1, aes(x = Country, y = Relative_Activity_Index,
                         fill = Research.Sector)) +
  geom_col() +
    xlab("Region")+
  ylab("Relative Activity Index")+
  labs(title = "Research Focus by Region in AI")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

" 
1. United States is leading overall in terms of the research in AI but is lagging behind in Agricultural compared to Europe and China.
2. United States is leading in Humanities in AI and Medical and Health Sector of AI compared to China and Europe.
3. China is at the top position when it comes to Engineering and Technology.

"

Summary and Conclusion

United States is leading Country in Total patents in Artificial Intelligence followed by Japan and China.
China is leading in Robotics installations, followed by Europe, Japan, and North America. The trend is increasing every year
Europe is leading overall every year 2015 - 2017 in terms of % of AI Publications followed by China and US. In China the trend is upward year over year although they are standing second in place.
United States is leading overall in terms of the research in AI but is lagging behind in Agricultural AI sector compared to Europe and China.
United States is leading in Humanities in AI and Medical and Health Sector of AI compared to China and Europe. China is at the top position when it comes to Engineering and Technology.

5. Trends in Topics of interest

Overview

We analyze the trends in data science topics over the last decade using data from the arXiv paper repository from 2009-2018. The data suggests that all topics of data science have experienced significant nearly exponential growth. However, if we examine the topics more closely, some areas have become hotter while others have diminished on a relative basis.

Data Import and Cleansing

Using the arXiv research website for Data

To explore these trends, we gathered data from the arXiv research website. Arxiv website hosts a popular and longstanding forum for academic research in mathematics, physics, statistics and computer science. Researchers submit papers electronically and are catalogued in the arxiv database.

We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:

submission date
authors
subject classification (self-described by authors)
title of the papers

The range of dates and subject classifications below are native arXiv categories. Our tags are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018

# A list of categories and years
# -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI", "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence", "Computer Vision",
                    "Computer Learning", "Robotics", "Computation and Language" ,
                      "Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
                            desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )

# Set up an empty dataframe of years range for row and data science
# topics for columns.   Values will store paper counts in arXiv by year and topic.
# ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0, 
                                   nrow = length(years_list), 
                                   ncol = length(ds_categories) ,
                                   dimnames = list( as.character( years_list ), ds_categories ) ) )

Collecting the paper counts

The following section downloads the paper counts by topic and year. Note that this step is computationally intensive and will cause an online resource restriction by the arXiv server if they feel that this query causes excessive or abusive use of computational resources.

As a result, we save the results to a local disk file. The following code chunk should set eval to equal TRUE to confirm the code works and allows downloads. Otherwise, for visualization graphics (or project final assembly), this step should be skipped. The downloaded data can be read from a file and the next code chunk. The data file is posted online.

#
#  Query the arXiv server for paper counts:  
#  Outer loop is on subjects
#  Inner loop is on years.
# -------------------------------------------------------------------------
for(subject in ds_categories )
{
  for( y in years_list )
  {
  
    
      qry = paste0("cat:", subject, " AND submittedDate:[", 
                  y, 
                  " TO ", 
                  y+1, "]")
      
      qry_count = arxiv_count(qry)
      info[as.character(y), subject] = qry_count
      print(paste(qry, " ", qry_count, "\n"))
      
      # Sleeping is essential to throttle API load on the arXiv server.
      # ------------------------------------------------------------------
      Sys.sleep(3)
  }
}
#  Write the contents to files to avoid re-running the above code during
#  final project assembly
# ------------------------------------------------------------------------
write.csv(info, file="Arxiv_topic_counts.csv", row.names = TRUE)

And reload the paper counts here.

subject_year_counts = as_tibble( read.csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/Arxiv_topic_counts.csv") )

Data Wrangling

Some minor data wrangling is needed to extract the marginal sums and fractions of annual production by topic. This is illustrated in the next code chunk.

Note that for each year in the Period files, the paper count is from Jan 1 of that year until Dec 31 of the same year.

# Set the names to make algebraic notation less cumbersome
# --------------------------------------------------------------------------------
names(subject_year_counts) = c("Period", as.character(subject_names$categories) )
# Calculate and store row sums
# -----------------------------------------------------------------
subject_year_counts %>% 
  group_by(Period) %>% 
  mutate( sum = stat.ML + cs.AI + cs.CV + cs.LG + cs.RO + cs.CL + cs.NE) %>% 
  mutate( stat.ML.pct  = stat.ML / sum , 
          cs.AI.pct    = cs.AI   / sum ,
          cs.CV.pct    = cs.CV   / sum ,  
          cs.LG.pct    = cs.LG   / sum ,
          cs.RO.pct    = cs.RO   / sum ,
          cs.CL.pct    = cs.CL   / sum ,
          cs.NE.pct    = cs.NE   / sum 
          ) -> subject_year_counts
#
#  Plot the change in percent importance of different topics over the last 10 years
# ----------------------------------------------------------------------------------
subject_year_counts %>% select( Period, stat.ML.pct:cs.NE.pct) %>%
       gather(key="Subject", value = "fraction", stat.ML.pct:cs.NE.pct) -> pct_data
ggplot( pct_data , aes(x=Period, y = fraction, fill= Subject ) ) + 
  geom_bar(stat="identity", position="fill") +
  scale_fill_brewer(palette="Set2") +
  scale_x_discrete(limits=2009:2018) +
  ggtitle("Percent of Data Science Papers by Topic on arXiv from 2009-2018")

# Show only 2009 and 2018 statistics and merger with longer descriptions
# Then display data by year-as-column to focus on changes
# ---------------------------------------------------------------------------------
pct_data %>% filter( Period == 2009 | Period == 2018 ) %>%
    mutate( topicCode = str_sub( Subject, start= 1, end = -5), fraction = 100 * fraction) %>%
    inner_join( subject_names, by = c("topicCode" = "categories") ) %>%
    select( Period, fraction, desc ) %>% 
    spread( key = Period, value = fraction ) -> table_to_show

The table below clearly shows significant changes in relative interest over a decade.

Pure AI has decreased in its relative important from 31.9 to 11.1 percent.
Computer Learning, Machine Learning, Computer Vision have grown to 71.3 percent of papers
Neural Computing and Robotics have remained static and relatively minor topics.

knitr::kable(table_to_show, digit = 1 , 
             caption = "Percent Share of Articles by Topic" ) %>%
       kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Percent Share of Articles by Topic
desc	2009	2018
Artificial Intelligence	31.9	11.1
Computation and Language	7.6	9.7
Computer Learning	19.2	27.2
Computer Vision	12.3	22.3
Neural and Evolutionary Computing	9.9	3.1
Robotics	5.4	4.8
Stat Machine Learning	13.6	21.8

Trends in the Total Volume of Research

The evidence below will show exponential growth in data science research. Statistic and machine learning and computer vision are driving the bulk of this work.

#
# Plot the change in absolute papers submitted
# -----------------------------------------------------------------
subject_year_counts %>% select( Period, stat.ML:cs.NE) %>%
  gather(key="Subject", value = "Count", stat.ML:cs.NE) -> abs_data
ggplot( abs_data , aes(x=Period, y = Count, fill= Subject ) ) + 
  geom_area() +
  scale_fill_brewer(palette="Set2") +
  scale_x_discrete(limits=2009:2018) +
  ggtitle("Count of Data Science Papers by Topic on arXiv from 2009-2018") +
  theme(legend.position= c(.1, .9 ),
        legend.justification = c("left", "top"))

Explosive Growth of Research

knitr::kable(subject_year_counts %>% select( Period, sum ) %>% spread( key = Period, value = sum ) , 
             caption = "Total Data Science Articles on Arxiv by Year" ) %>%
       kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Total Data Science Articles on Arxiv by Year
2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
1197	1679	2460	4819	6032	6815	9528	15124	22492	38433

A simple calculation from the above table allows us toconclude that the volume of research (by article count) has grown 47 percent annually. This explosive growth in research has resulted in a 32 fold increase in research in the most recent decade. Whether the quality matches the quantity is another issue. But this is compelling supporting evidence that artificial intelligence is revolutionizing academic research and thought leadership.

Summary and Conclusion

exponential growth in data science research. Statistic and machine learning and computer vision are driving the bulk of this work.
Pure AI has decreased in its relative important from 31.9 to 11.1 percent.
Computer Learning, Machine Learning, Computer Vision have grown to 71.3 percent of papers
Neural Computing and Robotics have remained static and relatively minor topics.
Volume of research (by article count) has grown 47 percent annually. This explosive growth in research has resulted in a 32 fold increase in research in the most recent decade. Whether the quality matches the quantity is another issue. But this is compelling supporting evidence that artificial intelligence is revolutionizing academic research and thought leadership.

6. Academia: Top Ranked Universities in AI, ML or DS

Introduction and Analysis

The section below loads the required packages and input data files containing the following details of research papers in Artificial Intelligence (AI), Data Science (DS), Machine Learning (ML), Visualization (VI): Names of faculty members who authored these papers, universities they’re affiliated to, conferences they presented at, and the year of publication. The conferences selected were restricted to those in the above fields only.

fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/author-info-ds.csv"
ds.authors<-read.csv(file=fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = TRUE)
#ds.authors<-fread(fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = FALSE)
#ds.authors

The section below filters the data for the 2 main columns of interest: University and Adjusted Count. The adjusted count is a score that measures contribution by authors based on joint ownership with other authors. The detailed methodology is available at http://csrankings.org/#/index?all

The analysis below aims to determine which university across the globe, can be deemed to be a thought leader in the Data Science, Artificial Intelligence, Machine Learning and Visualization areas, based on the contribution of their faculty members by writing research papers.

The following section selects the required columns, and calculates an aggregate of the adjusted count by university.Then it renames the columns, and derives the top 10 universities based on this adjusted count metric.

paper.count<-ds.authors%>%select(university,adjustedcount)
summary.paper.count<-aggregate(. ~ university, data = paper.count, sum)%>%setorder(-adjustedcount)
summary.paper.count$adjustedcount<-round(summary.paper.count$adjustedcount,2)
colnames(summary.paper.count)<-c("University", "Research_Adj_Count")
#summary.paper.count
top10<-summary.paper.count[1:10,]
top10

##                                University Research_Adj_Count
## 26             Carnegie Mellon University             515.84
## 254   University of California - Berkeley             250.29
## 352                   University of Tokyo             239.10
## 65        Georgia Institute of Technology             204.46
## 122 Massachusetts Institute of Technology             200.75
## 338     University of Southern California             198.27
## 328            University of Pennsylvania             194.29
## 242                 University of Alberta             173.05
## 212                             TU Munich             172.86
## 38                     Cornell University             167.04

The following section generates a barplot showing the top 10 universities by adjusted count of research papers. It shows that Carnegie Mellon sits at the top of the stack by a big margin. So it can considered as the prime thought leader from an institutional perspective.

As a topic for further research, it is notable that a big name like Stanford University is missing from the top 10 universities. We suspect that this could be on account of factors like departmental affiliations of faculty members and their choices on whether to present their research at pure Data Science type conferences vis-a-vis other conferences geared towards other domains such as Statistics or Economics. Also, it’s likely that if the focus is extended to all of Computer Science instead of a narrower selection of AI, DS, ML etc, then universities like Stanford may show a more significant presence while perhaps the universities in the top 10 are more exclusively focusing on AI, ML and DS research.

library(ggplot2)
library(RColorBrewer)
ggplot(top10, aes(x=reorder(University,Research_Adj_Count), y=Research_Adj_Count, fill=University))+ geom_bar(stat="identity",color="black") + coord_flip() + theme(legend.position='none') + ylab("Adjusted Research Paper Count") + xlab("Universities as Thought Leaders")

Trends in Data Science sub-topics as evidenced by change in research paper counts over the years

From the data and graph below, it can be seen that Machine Learning, Neural Networks and Computer Visualization are showing a very rapid growth in research publications.

fileurl3<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/AIPapersByTopic.csv"
ai.subtopics<-read.csv(fileurl3)
colnames(ai.subtopics)<-c("Year", "Machine Learning", "Neural Networks", "Computer Vision", "Search optimization", "NLP", "Fuzzy Systems", "Decision Making", "Total")
ai.subtopics

##    Year Machine Learning Neural Networks Computer Vision
## 1  1998             4319            4680            3460
## 2  1999             4826            5569            3868
## 3  2000             4636            5383            4087
## 4  2001             5191            5659            4430
## 5  2002             6189            6320            5366
## 6  2003             7325            6819            6186
## 7  2004            10256            8590            8132
## 8  2005            11743            9520            9274
## 9  2006            13068           10487           10670
## 10 2007            15453           11594           12849
## 11 2008            19786           14454           16942
## 12 2009            16699           14270           14041
## 13 2010            17122           14302           14298
## 14 2011            16962           13853           14193
## 15 2012            17657           13601           13451
## 16 2013            18243           14325           13629
## 17 2014            21263           15851           15877
## 18 2015            25289           17683           18854
## 19 2016            28785           22301           21330
## 20 2017            34461           29584           25006
##    Search optimization  NLP Fuzzy Systems Decision Making  Total
## 1                 1396 2087          2362             771  19075
## 2                 1597 2142          2605             872  21479
## 3                 1765 2424          2471            1096  21862
## 4                 2099 2597          2923            1492  24391
## 5                 2427 3018          2894            1833  28047
## 6                 2885 3780          3344            2531  32870
## 7                 3711 4976          4189            3239  43093
## 8                 4094 5715          4161            3911  48418
## 9                 4847 6361          4587            3901  53921
## 10                6039 6717          5144            4162  61958
## 11                7598 8381          6717            4436  78314
## 12                7193 7776          7160            4186  71325
## 13                7595 8117          7002            3787  72223
## 14                7379 7124          6162            4454  70127
## 15                7026 7302          5310            3293  67640
## 16                7255 6635          5330            3388  68805
## 17                7423 7816          5210            3278  76718
## 18                8195 9339          5597            3825  88782
## 19                8816 9810          6390            3881 101313
## 20                9333 9099          6290            3892 117665

ai.subtopics.long<-gather(ai.subtopics, key="Sub_Topic", value=Paper_Count, 2:8)
ggplot(ai.subtopics.long, aes(x=Year, y=Paper_Count, group=Sub_Topic,colour=Sub_Topic)) + geom_line()+xlab("Years")+ylab("Paper Count")

Percentage of AI and ML course enrollments in US Universities at the Undergraduate Level

fileurl2<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/USAI-MLUndergradEnrolmentPercentage.csv"
undergrad<-read.csv(fileurl2)
colnames(undergrad)<-c("University", "Domain", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017")
#undergrad
undergrad.long<-gather(undergrad, key="Year", value=Percent_Share, 3:10)
undergrad.long$UnivDomain=paste(undergrad.long$University, undergrad.long$Domain, sep="-")
undergrad.long$Percent_Share=round(undergrad.long$Percent_Share*100,2)
undergrad.long

##    University Domain Year Percent_Share  UnivDomain
## 1    Berkeley     AI 2010             2 Berkeley-AI
## 2    Stanford     AI 2010             3 Stanford-AI
## 3        UIUC     AI 2010             1     UIUC-AI
## 4          UW     AI 2010             0       UW-AI
## 5    Berkeley     ML 2010             0 Berkeley-ML
## 6    Stanford     ML 2010             5 Stanford-ML
## 7        UIUC     ML 2010             0     UIUC-ML
## 8          UW     ML 2010             0       UW-ML
## 9    Berkeley     AI 2011             2 Berkeley-AI
## 10   Stanford     AI 2011             3 Stanford-AI
## 11       UIUC     AI 2011             0     UIUC-AI
## 12         UW     AI 2011             0       UW-AI
## 13   Berkeley     ML 2011             0 Berkeley-ML
## 14   Stanford     ML 2011             5 Stanford-ML
## 15       UIUC     ML 2011             0     UIUC-ML
## 16         UW     ML 2011             0       UW-ML
## 17   Berkeley     AI 2012             3 Berkeley-AI
## 18   Stanford     AI 2012             3 Stanford-AI
## 19       UIUC     AI 2012             1     UIUC-AI
## 20         UW     AI 2012             0       UW-AI
## 21   Berkeley     ML 2012             0 Berkeley-ML
## 22   Stanford     ML 2012             8 Stanford-ML
## 23       UIUC     ML 2012             0     UIUC-ML
## 24         UW     ML 2012             0       UW-ML
## 25   Berkeley     AI 2013             3 Berkeley-AI
## 26   Stanford     AI 2013             3 Stanford-AI
## 27       UIUC     AI 2013             1     UIUC-AI
## 28         UW     AI 2013             1       UW-AI
## 29   Berkeley     ML 2013             1 Berkeley-ML
## 30   Stanford     ML 2013            10 Stanford-ML
## 31       UIUC     ML 2013             0     UIUC-ML
## 32         UW     ML 2013             0       UW-ML
## 33   Berkeley     AI 2014             4 Berkeley-AI
## 34   Stanford     AI 2014             5 Stanford-AI
## 35       UIUC     AI 2014             1     UIUC-AI
## 36         UW     AI 2014             1       UW-AI
## 37   Berkeley     ML 2014             1 Berkeley-ML
## 38   Stanford     ML 2014            11 Stanford-ML
## 39       UIUC     ML 2014             1     UIUC-ML
## 40         UW     ML 2014             1       UW-ML
## 41   Berkeley     AI 2015             4 Berkeley-AI
## 42   Stanford     AI 2015             8 Stanford-AI
## 43       UIUC     AI 2015             1     UIUC-AI
## 44         UW     AI 2015             1       UW-AI
## 45   Berkeley     ML 2015             3 Berkeley-ML
## 46   Stanford     ML 2015            13 Stanford-ML
## 47       UIUC     ML 2015             0     UIUC-ML
## 48         UW     ML 2015             1       UW-ML
## 49   Berkeley     AI 2016             3 Berkeley-AI
## 50   Stanford     AI 2016             9 Stanford-AI
## 51       UIUC     AI 2016             1     UIUC-AI
## 52         UW     AI 2016             1       UW-AI
## 53   Berkeley     ML 2016             3 Berkeley-ML
## 54   Stanford     ML 2016             9 Stanford-ML
## 55       UIUC     ML 2016             1     UIUC-ML
## 56         UW     ML 2016             1       UW-ML
## 57   Berkeley     AI 2017             4 Berkeley-AI
## 58   Stanford     AI 2017            12 Stanford-AI
## 59       UIUC     AI 2017             3     UIUC-AI
## 60         UW     AI 2017             2       UW-AI
## 61   Berkeley     ML 2017             2 Berkeley-ML
## 62   Stanford     ML 2017            13 Stanford-ML
## 63       UIUC     ML 2017             2     UIUC-ML
## 64         UW     ML 2017             1       UW-ML

The following graph shows the academic enrolment at the undergraduate level in AI and ML courses in selected US universities, over the 2010-2017 period. From this, it can be seen that academic enrolment has been trending up over the past few years in these universities, which can be seen as representative across all US universities.

library(ggplot2)
ggplot(undergrad.long, aes(x=Year, y=Percent_Share, group=UnivDomain,colour=UnivDomain)) + geom_line()+xlab("Years")+ylab("Percent of Total")

Regions and Countries as Thought Leaders

The following section collects the inputs: regional percentage share of Artificial Intelligence publications over the 1998-2017 period. The data is loaded and tidied from a wide format to a long format, setting it up for further analysis.

fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/RegionalShareofAIPublications.csv"
regional.ai<-read.csv(fileurl)
colnames(regional.ai)<-c("Region","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011", "2012","2013","2014","2015","2016","2017")
#regional.ai
regional.ai.long<-gather(regional.ai, key="Year", value=Percentage_Share, 2:21)
regional.ai.long

##           Region Year Percentage_Share
## 1          China 1998             0.08
## 2  United States 1998             0.28
## 3         Europe 1998             0.35
## 4  Rest of World 1998             0.28
## 5          China 1999             0.06
## 6  United States 1999             0.27
## 7         Europe 1999             0.35
## 8  Rest of World 1999             0.32
## 9          China 2000             0.09
## 10 United States 2000             0.25
## 11        Europe 2000             0.35
## 12 Rest of World 2000             0.31
## 13         China 2001             0.10
## 14 United States 2001             0.23
## 15        Europe 2001             0.36
## 16 Rest of World 2001             0.31
## 17         China 2002             0.12
## 18 United States 2002             0.24
## 19        Europe 2002             0.34
## 20 Rest of World 2002             0.30
## 21         China 2003             0.14
## 22 United States 2003             0.23
## 23        Europe 2003             0.35
## 24 Rest of World 2003             0.28
## 25         China 2004             0.16
## 26 United States 2004             0.22
## 27        Europe 2004             0.33
## 28 Rest of World 2004             0.28
## 29         China 2005             0.18
## 30 United States 2005             0.22
## 31        Europe 2005             0.32
## 32 Rest of World 2005             0.27
## 33         China 2006             0.20
## 34 United States 2006             0.19
## 35        Europe 2006             0.33
## 36 Rest of World 2006             0.28
## 37         China 2007             0.19
## 38 United States 2007             0.18
## 39        Europe 2007             0.33
## 40 Rest of World 2007             0.30
## 41         China 2008             0.27
## 42 United States 2008             0.15
## 43        Europe 2008             0.30
## 44 Rest of World 2008             0.28
## 45         China 2009             0.29
## 46 United States 2009             0.14
## 47        Europe 2009             0.31
## 48 Rest of World 2009             0.26
## 49         China 2010             0.27
## 50 United States 2010             0.14
## 51        Europe 2010             0.31
## 52 Rest of World 2010             0.28
## 53         China 2011             0.25
## 54 United States 2011             0.15
## 55        Europe 2011             0.31
## 56 Rest of World 2011             0.29
## 57         China 2012             0.24
## 58 United States 2012             0.15
## 59        Europe 2012             0.31
## 60 Rest of World 2012             0.30
## 61         China 2013             0.24
## 62 United States 2013             0.15
## 63        Europe 2013             0.32
## 64 Rest of World 2013             0.29
## 65         China 2014             0.24
## 66 United States 2014             0.15
## 67        Europe 2014             0.31
## 68 Rest of World 2014             0.29
## 69         China 2015             0.23
## 70 United States 2015             0.18
## 71        Europe 2015             0.32
## 72 Rest of World 2015             0.28
## 73         China 2016             0.24
## 74 United States 2016             0.17
## 75        Europe 2016             0.30
## 76 Rest of World 2016             0.30
## 77         China 2017             0.25
## 78 United States 2017             0.17
## 79        Europe 2017             0.28
## 80 Rest of World 2017             0.30

Countries as Thought Leaders

The following section shows a regional breakdown of AI papers published on Scopus for the uears 1998-2017. The source of this data is Elsevier. The broad regional categories are: USA, Europe, China and Rest of World (RoW). Based on this, it can be seen that Europe is the leading contributor to papers and publications in this domain over the years followed closely by RoW. China can be seen steadily increasing its share of research publications in this area. Based on this metric, Europe can be considered as the Thought Leader from a regional perspective.

ggplot(regional.ai.long, aes(x=Year, y=Percentage_Share, fill=Region)) +
geom_bar(stat="identity", colour="black") +
guides(fill=guide_legend(reverse=TRUE)) +
scale_fill_brewer(palette="Pastel1") + theme(text = element_text(size=11),axis.text.x = element_text(angle=90, hjust=1))

Summary and Conclusion

Machine Learning, Neural Networks and Computer Visualization are showing a very rapid growth in research publications.
The academic enrolment at the undergraduate level in AI and ML courses in selected US universities, over the 2010-2017 period. From this, it can be seen that academic enrolment has been trending up over the past few years in these universities, which can be seen as representative across all US universities.
Europe is the leading contributor to papers and publications in this domain over the years followed closely by RoW. China can be seen steadily increasing its share of research publications in this area. Based on this metric, Europe can be considered as the Thought Leader from a regional perspective.
Carnegie Mellon sits at the top of the stack by a big margin. So it can considered as the prime thought leader from an institutional perspective. A big name like Stanford University is missing from the top 10 universities. We suspect that this could be on account of factors like departmental affiliations of faculty members and their choices on whether to present their research at pure Data Science type conferences vis-a-vis other conferences geared towards other domains such as Statistics or Economics. Also, it’s likely that if the focus is extended to all of Computer Science instead of a narrower selection of AI, DS, ML etc, then universities like Stanford may show a more significant presence while perhaps the universities in the top 10 are more exclusively focusing on AI, ML and DS research.

7. Conclusion

Based on popular narrative, Andrew Ng and Kira Radinsky can be considered to be two of the primary thought leaders in Data Science. Over the pasy few years, they have shifted their focus from AI to more specific topics of specialisations such as deep learning, predictive analytics, and speech recognition.

We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.

Regarding sub-topics being researched, Computer / Machine Learning and Computer Vision have grown in popularity at the cost of pure AI, while Neural Computing and Robotics have remained largely stable.

When it comes to countries and regions, the USA leads in terms of patents in DS/AI/ML, while Europe leads in terms of generating research publications. China is catching up rapidly - for example Tsingshua University led the way in terms of research papers in this domain during 2018. Universities in the USA are showing a steady increase in enrolment in AI/ML courses, and Carnegie Melon continutes to lead the way in research in this domain.

We relied on a variety of sources for the data used in our analysis, such as the AI Index Report for 2018, the CS Rankings, aRxiv and ACM websites, as well as personal websites of individuals such as Andrew Ng. In terms of future research, we think that understanding what’s driving the changes in topics of interest for popular data science professionals as well as academia would be a good area to focus on.

Project -3 Data 607

Tyrannosaurus (Alexander Ng, Arun Reddy, Henry Otuadinma, Jagdish Chhabria)

March 27, 2019

R Packages Used

Collabration and Team members

1. Thought Leadership in Data Science

Reading this report in HTML

Overview

Outline of paper structure and summary of findings

Strengths, weakness of our and other approaches

2. People Who are thought leaders based on Popular narrative

Introduction and Analysis

Summary and Conclusion

3. Thought Leadership Through Research

Introduction and Analysis

Measuring Thought Leadership Through Research Paper Counts

Data Collection Process

Code to download and merge data files

Wrangling the data

Summary and Conclusions

4. Geographic Trends

Introduction and Analysis

Patents

Overview

Data & Import Cleansing

Analysis & Observation

Robotics

Summary and Analysis

Data & Import Cleansing

Analysis & Observation

AI Publications

Overview

Data & Import Cleansing

Analysis & Observation

By Research Area

Overview

Data & Import Cleansing

Analysis & Observation

Summary and Conclusion

5. Trends in Topics of interest

Overview

Data Import and Cleansing

Summary and Conclusion

6. Academia: Top Ranked Universities in AI, ML or DS

Introduction and Analysis

Trends in Data Science sub-topics as evidenced by change in research paper counts over the years

Percentage of AI and ML course enrollments in US Universities at the Undergraduate Level

Regions and Countries as Thought Leaders

Countries as Thought Leaders

Summary and Conclusion

7. Conclusion