This assignment was accomplished by utilizing these packages for both data analysis and visualizations.
library("dplyr")
library("RCurl")
library("XML")
library("xml2")
library("jsonlite")
library("ggplot2")
library("DT")
library("kableExtra")
library("data.table")
library("tidyr")
library("lubridate")
#library("XLConnectJars")
#library("XLConnect")
library("stringr")
library("formattable")
library("aRxiv")
library("tidyverse")
library("rvest")
library("textrank")
library("lattice")
library("igraph")
library("ggraph")
library("wordcloud")
library("curl")
library("treemap")
options(scipen = 999)Alexander Ng, Arun Reddy, Henry Otuadinma, Jagdish Chhabria
Our team consists of four members. Each one of us collabrated with each other using Webex, Slack, GitHub.
The reader needs to read both tabs of each section otherwise the paper is confusing.
DATA SCIENCE THOUGHT LEADERSHIP
Project Objective: The objective of this project is to answer the following questions: 1) Who are today’s “thought leaders” in data science? 2) What are the topics that data scientists care most about? 3) How do these change over time, and across geographical location?
Let’s start by defining the terms: Thought Leader and Data Science.
A thought leader is an individual, organization or nation state that is recognized as an authority in a specialized field and whose expertise is sought and often rewarded. They are trusted sources who move and inspire people with innovative ideas; turn ideas into reality, and know and show how to replicate their success. Thought leaders are commonly asked to speak at public events, conferences or webinars to share their insight with a relevant audience. The Oxford English Dictionary gives as its first citation for the phrase an 1887 description of Henry Ward Beecher as “one of the great thought-leaders in America.
Data science (DS) is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science encompasses the fields of data mining and big data. Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. Data Science is closely related with fields such as Machine Learning (ML), Artificial Intelligence (AI) and Computer Visualization.
There are some inherent contradictions associated with thought leadership. Given that thinking is supposed to be an individual activity, wherein one relies on one’s own logic and intelligence to form an understanding, opinion and make potential choices, it is paradoxical to have external thoughts being anointed as the “leader” or best way to think about a subject. Having said that, the real difficulty arises when trying to decide who or what makes someone a thought leader in any given field. This issue is further compounded if the field in question is a rapidly changing, complex domain such as Data Science. There are no easily available metrics (for example: a Nobel Prize in Data Science) that can be easily referred to, for determining thought leadership.
Given this, we’ve adopted the following approach and structure for this project:
This report is organized by the entity considered as a thought leader. We begin by analyzing the role of individuals as thought leaders. One section considers individuals based on popular acclaim while the following section considers tangible research metrics such as publication statistics. Next, we examine universities - possibly the preeminent type of organization - for this type of study. This is followed by our evaluation of nations and geographical regions. Then, we focus on trends in research topics looking at which subtopics of data science have waxed and waned. The final section concludes.
We also attempt to show how one could go about determining the sub-topics that individual thought-leaders under 1.a) are most interested in, by taking a deeper dive into scraping and parsing material available on the work.
Our methods & research relied on the gathering the information from Research publications, trends in the topic of interests, Geographic trends, academic metrics, popular narratives. Most of the methods and analysis are based solely on either analysis based on Social media or University research which might not give the complete 360 degrees. For instance, social media analysis can tell who’s the famous person and who has the most following but it doesn’t say that the person or firm in question is thought leader. Nicola Tesla, who was a pioneer in his field has no social media access but his work touched human existence.
We looked up top influencers/thought leaders in data science and this yielded a lot of people but we narrowed them down to the top 10 we think have huge influence in different areas of research and interests in data science.
For deeper insight, we chose to focus on two of them: Andrew Ng and Kira Radinsky. We could retrieve Andrew Ng’s publications from arXiv, while web-scraping was carried out on Association for Computing Machinery website for Kira Radinsky’s publications, which yielded useful information for studies. We extracted the abstracts from their publications to see what topics and areas interest them.**
We curated a list of the top 10 thought leaders and wrote the list to a csv file
thoughtleaders <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/thought_leaders.csv', header = TRUE)datatable(thoughtleaders, colnames= c("Name", "Occupation", "Link"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))We generated tags we feel feature often in topics that interest them and saved these in a .csv. These will help in filtering appropriate keywords from their publications
#read the keyword tags from csv
tag_ex <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/keyword_tags.csv', header = TRUE)
tag_ex <- as.character(tag_ex$x)I. Andrew Y. Ng
His publications were sourced from the arxive api
#These queries returned different results
aNgArxiv = arxiv_search('au: "Andrew Ng"')
aNgArxiv1 = arxiv_search('au: "Andrew Y. Ng"')combine Andrew Ng’s data
aNgDf <- rbind(aNgArxiv, aNgArxiv1)
# Removing the first row because the paper was later withdrawn for corrections
aNgDf = aNgDf[-1,]
row.names(aNgDf) <- NULLsubmitted = str_extract(aNgDf$submitted, '\\d+')
anNg <- aNgDf %>% select(title, authors)
anNg['submitted'] <- submitteddatatable(head(anNg), colnames= c("Title", "Author(s)", "Date"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))ta <- vector()
abT <- vector()
ayn <- vector()for(i in 1: nrow(aNgDf))
{
row <- aNgDf[i,]
k <- row$abstract %>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower()%>% str_extract_all(tag_ex)%>%unlist()
for(j in 1: length(k))
{
ta <- c(ta, row$title)
ayn <- c(ayn, as.numeric(str_extract(row$submitted, "\\d+")))
abT <- c(abT, k[j])
}
}df <- data.frame(title=ta, year=ayn, keyword=abT)write to csv
write.csv(df, "an_Ng.csv", row.names=FALSE)#read from csv
an_df <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/an_Ng.csv', header = TRUE)aNkeywords <-an_df%>% select(year, keyword)%>% group_by(keyword, year) %>% mutate(frequency = n())%>%unique()sort keywords in descending order
aNkw <- aNkeywords[order(-aNkeywords$frequency),, drop=FALSE]datatable(head(aNkw), colnames= c("Year", "Keyword", "Frequency"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))Top 20 keywords in Andrew Ng’s publications over the years
dplt <- ggplot(data=head(aNkw, 20), aes(x = year, y=frequency, fill = keyword)) +
geom_bar(position="fill", stat = "identity") +
ggtitle("top Keywords in Andrew Ng's publications over the years") +
xlab("Year")+
theme(plot.title = element_text(lineheight = .8, face = "bold"))
dplt + theme(legend.position="right")Top 20 keywords in Andrew Ng’s publications without considering the years
topKeyW <-as.data.frame(table(abT))
names(topKeyW)<-c("keyword","frequency")dplt <- ggplot(data=head(topKeyW, 20), aes(x = reorder(keyword, frequency), y=frequency, fill = "steelblue")) +
geom_bar(stat = "identity") +
xlab("Keyword")+
ylab("Frquency")+
ggtitle("Andrew Ng's top Keywords without the years") +
theme(plot.title = element_text(lineheight = .8, face = "bold")) +
theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 9))+ coord_flip()
dplt + theme(legend.position="none")Her publications sourced by searching for her name on Association for Computing Machinery website
# Hand-picked links according to relevance
kradlinks <- c('citation.cfm?id=2491802', 'citation.cfm?id=2187918', 'citation.cfm?id=2493181', 'citation.cfm?id=2491802', 'citation.cfm?id=2187918', 'citation.cfm?id=2493181', 'citation.cfm?id=2433500', 'citation.cfm?id=2433448', 'citation.cfm?id=1963455', 'citation.cfm?id=2187958', 'citation.cfm?id=3192292', 'citation.cfm?id=2433431', 'citation.cfm?id=1487070', 'citation.cfm?id=3219882', 'citation.cfm?id=2348364', 'citation.cfm?id=3096469', 'citation.cfm?id=1935850', 'citation.cfm?id=2422275')kradTitles <- vector()
kradAbstracts <- vector()
kradYears <- vector()Make a search on http://dl.acm.org and pull links
khtms <- tryCatch(html_nodes(read_html(curl('https://dl.acm.org/results.cfm?within=owners.owner%3DHOSTED&srt=_score&query=Kira+Radinsky&Go.x=0&Go.y=0', handle = new_handle("useragent" = "Mozilla/5.0"))), 'div.details'),
error = function(e){list(result = NA, error = e)})The above search returned a lot of links but they need to be filtered to get the relevant ones
for(i in 1: length(khtms))
{
href <- html_attr(html_nodes(khtms[i], 'div.title a'), 'href')
if(href %in% kradlinks)
{
kradTitles <- c(kradTitles, khtms[i]%>%html_nodes('div.title a')%>% html_text()%>% str_replace_all('\n', '')%>%str_replace_all('\t', '')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower())
kradYears <- c(kradYears, khtms[i]%>%html_nodes('span.publicationDate')%>% html_text()%>% str_replace_all('\n', '')%>%str_replace_all('\t', '')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower())
r <- html_node(read_html(curl(paste('https://dl.acm.org/', href, '&preflayout=flat', sep=''), handle = new_handle("useragent" = "Mozilla/5.0"))), 'div.flatbody')
paragraphs <- html_nodes(r, 'p')
pTexts <- NULL
for(j in 1: length(paragraphs))
{
pText <- paragraphs[j]%>% html_text()%>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>% str_replace_all('\"', '')%>%str_trim(side='both')%>%tolower()
pTexts <- paste(pTexts, o, collapse=",")
}
kradAbstracts <- c(kradAbstracts, pText)
Sys.sleep(10)
}
}tt <- vector()
aa <- vector()
yy <- vector()for(i in 1: length(kradAbstracts))
{
k <- kradAbstracts[i] %>% str_replace_all('\n', ' ')%>%str_replace_all('\t', ' ')%>%str_replace_all('\r', '')%>%str_trim(side='both')%>%tolower()%>% str_extract_all(tag_ex)%>%unlist()
for(j in 1: length(k))
{
tt <- c(tt, kradTitles[i])
yy <- c(yy, as.numeric(str_extract(kradYears[i], "\\d+")))
aa <- c(aa, k[j])
}
}dfk<- data.frame(title=tt, year=yy, keyword=aa)write to csv
write.csv(dfk, "kira_radinsky.csv", row.names=FALSE)
Write all keywords to .csvwrite.csv(aa, "kr_keywords.csv", row.names=FALSE)read from .csv
#read from csv
kira_df <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/kira_radinsky.csv', header = TRUE)#read from csv
allkw <- read.csv('https://raw.githubusercontent.com/henryvalentine/MSDS2019/master/Classes/DATA%20607/Projects/Project3/Data_Science_Thought_Leaders/kr_keywords.csv', header = TRUE)
allkw <- as.character(allkw$x)datatable(head(kira_df), colnames= c("Title", "Year", "Keyword"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))kkeywords <-kira_df%>% select(year, keyword)%>% group_by(keyword, year) %>% mutate(frequency = n())%>%unique()sort keywords in descending order
kw <- kkeywords[order(-kkeywords$frequency),, drop=FALSE]datatable(head(kw), colnames= c("Year", "Keyword", "Frequency"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))kw1 <- subset(kw, year != '2014' & year != '2015' & year != '2016')dplt <- ggplot(data=head(kw1, 20), aes(x = year, y=frequency, fill = keyword)) +
geom_bar(position="fill", stat = "identity") +
ggtitle("top Keywords in Kira Radinsky's publications over the years") +
xlab("Year")+
theme(plot.title = element_text(lineheight = .8, face = "bold"))
dplt + theme(legend.position="right")Top 20 keywords in Kira Radinsky’s publications without considering the years
kTopics <-as.data.frame(table(allkw))
names(kTopics) <- c('keyword', 'frequency')dplt <- ggplot(data=head(kTopics, 20), aes(x = reorder(keyword, frequency), y=frequency, fill = "steelblue")) +
geom_bar(stat = "identity") +
xlab("Keyword")+
ylab("Frquency")+
ggtitle("Kira Radinsky's top Keywords without the years") +
theme(plot.title = element_text(lineheight = .8, face = "bold")) +
theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 9))+ coord_flip()
dplt + theme(legend.position="none")From the above representations, it is obvious that both thought leaders focused on AI earlier on but later started shifting their focus to more specific topics of specialisations such as deep learning, predictive analytics, and speech recognition. In all, they have talked about AI more than any other topic because most part of the early stages of their carrier were focused on AI only but they started focusing on more than one areas of interest simultaneously with time.
These two individuals were chosen because we observed that all the leaders followed a similar trend. They start with one broad area of interest first and then start focusing more on more than one specialised topics as the time go by.
We are of the belief that drilled-down insight on the interests of the other thought leaders will yield similar results.
The sources of this study were gotten from their publications/papers and do not reflect in its entirty, their complete interest areas. A more robust approach should also involve mining for appropriate keywords from their tweets, blogs, interviews, and key notes delivered in conferences
The list we curated is based on evidences of influence and dedicated activities these people have put towards data science, therefore, someone else can have their own list different from ours
Arxiv website is one of the top electronic paper repositories for academic research in multiple fields: computer science, physics, mathematics and statistics. Researchers submit papers electronically and are catalogued in the arxiv database. By analyzing the level of activity of a researcher in submitting papers on data science topics to arxiv, we get an objective, quantifiable, and relevant measure of their thought leadership.
In the next section, we will describe the data collection process, its limitation and outputs. After showing how the data and raw files are processed, we wrangle the consolidated data into usable form. Then, we present rankings of the top leaders and descriptive statistics of the papers and conclude with some interpretative remarks.
We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:
The range of dates and subject classifications below are native arXiv categories.
Our 7 subject classification categories are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018 Because we discuss the arXiv API at length in another section of this project on time trends in research, we give a brief summary here.
In this section, we identify the specific steps relevant to author page counts.
To obtain authors and titles of papers, we require downloading the full record of each paper submission. This raw data required 1 hour to download through a series of trial and error batch scripts because of two issues:
By defining granular queries, we limited most API requests to under 5000 records and successfully gathered all paper records.
This yielded 70 raw files by year and category. We combined them into a single big file in two steps: we aggregated all years into one category file, and all category files into a single big file.
The big file had 7 columns: * ID (unique identifier of the article) * submitted (date/time of submission) * updated (date/time of last revision submitted) * title ( name of the paper) * authors ( a pipe delimited list of co-authors of the papers) * primary_category (used for the query ) * categories (pipe-delimited list of alternate categories)
The most important step to produce a single consolidated records file was eliminating unnecessary fields: the abstract. Each paper’s record includes its abstract. For most papers, an abstract represents 90 percent of the record size. Due to the large file size, this simplification was needed to allow all records to fit into memory on our PC.
The final result is a flat file with 57193 records called output_all_subjects.csv.
The code to download the required data below has been described in the previous section. Due to the fact that the arXiv server API may produce variable results or throttle access, we show but don’t run the code block below. This is controlled by setting eval=FALSE in the relevant code chunks.
##### A list of categories and years
##### -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI" , "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence" , "Computer Vision",
"Computer Learning", "Robotics", "Computation and Language" ,
"Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )##### Set up an empty dataframe of years range for row and data science
##### topics for columns. Values will store paper counts in arXiv by year and topic.
##### ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0,
nrow = length(years_list),
ncol = length(ds_categories) ,
dimnames = list( as.character( years_list ), ds_categories ) ) )
##### Query the arXiv server for paper counts:
##### Outer loop is on subjects
##### Inner loop is on years.
##### -------------------------------------------------------------------------
for(subject in subject_names$categories )
{
for( y in years_list )
{
qry = paste0("cat:", subject, " AND submittedDate:[",
y,
" TO ",
y+1, "]")
qry_count = arxiv_count(qry)
qry_details = arxiv_search(qry, batchsize = 100, limit = 11000, start = 0 )
info[as.character(y), subject] = qry_count
print(paste(qry, " ", qry_count, "\n"))
output_filename = paste0(subject, "_", y, "_", "results.csv")
write.csv(qry_details, file = output_filename)
print(paste0("Wrote file: ", output_filename, Sys.time() ) )
##### Sleeping is essential to throttle API load on the arXiv server.
##### ------------------------------------------------------------------
Sys.sleep(5)
}
}
print("Retrieval of arXiv query records is now completed.")
for(j in seq_along(ds_categories ) )
{
subject = ds_categories[j]
outputdf = list( )
my_files = paste0(subject, "_", years_list, "_", "results.csv")
for( i in seq_along(my_files) )
{
fulldata <- read_csv(file = my_files[i])
print(paste0( "Loaded ", i, " ", my_files[i] ) )
##### Strip out the abstract which takes up most file space.
##### ------------------------------------------------------------------------
fulldata %>% select( id, submitted, updated, title, authors, primary_category, categories) -> tempdata
outputdf[[i]] = tempdata
Sys.sleep(1)
}
##### Write all the year files for one subject to one tibble and then
##### dump to one subject specific file
##### -----------------------------------------
big_data = bind_rows(outputdf)
output_big_subject = paste0("bigdata_", subject, ".csv")
write_csv(big_data, output_big_subject )
print(paste0( "Wrote file ", output_big_subject, " to disk ", Sys.time() ) )
}
my_files = paste0("bigdata_", ds_categories, ".csv")
outputdf = list()
for(j in seq_along(my_files ) )
{
fulldata <- read_csv(file = my_files[j])
print(paste0( "Loaded ", j, " ", my_files[j] ) )
outputdf[[j]] = fulldata
Sys.sleep(1)
}
##### We row-bind the list of dataframes into one big one using
##### a nice one-liner in dplyr. The result is one big tibble.
##### ---------------------------------------------------------------------
big_data = bind_rows(outputdf)
output_all_subjects = "output_all_subjects.csv"
write_csv(big_data, output_all_subjects )
print(paste0( "Wrote file ", output_all_subjects, " to disk ", Sys.time() ) )The entire analysis in this section depends only on loading the raw files in the next code chunk. We illustrate the content with a few records below.
big_paper_set = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/output_all_subjects.csv")## Parsed with column specification:
## cols(
## id = col_character(),
## submitted = col_datetime(format = ""),
## updated = col_datetime(format = ""),
## title = col_character(),
## authors = col_character(),
## primary_category = col_character(),
## categories = col_character()
## )
knitr::kable(head(big_paper_set, 4) ,
caption = "Representative Records from the Paper Records" ) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| id | submitted | updated | title | authors | primary_category | categories |
|---|---|---|---|---|---|---|
| 0901.0356v1 | 2009-01-05 06:37:01 | 2009-01-05 06:37:01 | Information, Divergence and Risk for Binary Experiments | Mark D. Reid|Robert C. Williamson | stat.ML | stat.ML|math.ST|stat.TH |
| 0901.1365v1 | 2009-01-10 12:12:31 | 2009-01-10 12:12:31 | Differential Privacy with Compression | Shuheng Zhou|Katrina Ligett|Larry Wasserman | stat.ML | stat.ML|math.ST|stat.TH |
| 0901.1504v2 | 2009-01-12 05:02:18 | 2009-10-13 03:01:23 | A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem | Bharath Sriperumbudur|David Torres|Gert Lanckriet | stat.ML | stat.ML|stat.ME |
| 0901.2044v2 | 2009-01-14 15:34:13 | 2010-10-21 13:30:14 | SPADES and mixture models | Florentina Bunea|Alexandre B. Tsybakov|Marten H. Wegkamp|Adrian Barbu | math.ST | math.ST|stat.ML|stat.TH |
Next, we remove duplicate records in the raw data set. Duplicate records arise because a paper may be classified as matching two or more computer science categories. For example, a paper may fall into Statistical Machine learning (stat.ML) and Computer Vision (cs.CV). This removes roughly 12000 duplicate records.
##### Remove duplicate records and true all information.
##### -----------------------------------------------------------------------------------
big_paper_clean <- ( big_paper_set %>% distinct( id, authors, .keep_all = TRUE))
nrow(big_paper_clean)## [1] 45150
paper_authors = big_paper_clean$authors
author_names = str_split(paper_authors, "\\|") # separates all the authors
##### The coauthors of a paper are consecutively listed in preceded by all authors
##### of earlier papers.
##### ---------------------------------------------------------------------
authors_unlisted = unlist(author_names)
num_author_paper_tuple = length(authors_unlisted)
##### Index j corresponds to the j-th paper in big_paper_clean
##### Value at index j corresponds to the number of co-authors in paper j
##### ----------------------------------------------------------------------
vec_coauthor_counts = unlist( lapply(author_names, length ) )
paper_author_map = tibble( id = character(num_author_paper_tuple), author = character(num_author_paper_tuple) )
idx_unlisted = 0The following code chunk maps the papers to authors in a 1-to-many relationship. Due to the inefficiency of the process, (over 10 minutes) to generate the mapping, I am saving the results to a flat file and setting eval=FALSE. At the next step, the data is reloaded from file to a dataframe for analysis.
for( id_idx in 1:length(big_paper_clean$id) )
{
num_coauthors = vec_coauthor_counts[id_idx]
for(s in 1:num_coauthors)
{
paper_author_map$id[ idx_unlisted + s ] = big_paper_clean$id[ id_idx]
paper_author_map$author[ idx_unlisted + s ] = authors_unlisted [ idx_unlisted + s]
}
idx_unlisted = idx_unlisted + num_coauthors
if( id_idx %% 100 == 0 )
{
print(paste0(" idx = ", id_idx))
}
}
write_csv(paper_author_map, "paper_author_map.csv")paper_author_map = read_csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/paper_author_map.csv")## Parsed with column specification:
## cols(
## id = col_character(),
## author = col_character()
## )
by_author <- group_by( paper_author_map , author )
rankings <- summarize( by_author, numPapers = n() ) %>% arrange( desc( numPapers))
knitr::kable(head(rankings, 30) , caption = "Top 30 Authors by Data Science Paper Counts (2009-2018)")| author | numPapers |
|---|---|
| Yoshua Bengio | 174 |
| Sergey Levine | 105 |
| Pieter Abbeel | 100 |
| Michael I. Jordan | 98 |
| Uwe Aickelin | 97 |
| Chunhua Shen | 89 |
| Francis Bach | 80 |
| Kyunghyun Cho | 78 |
| Toby Walsh | 78 |
| Zoubin Ghahramani | 74 |
| Masashi Sugiyama | 73 |
| Eric P. Xing | 72 |
| Shie Mannor | 72 |
| Damien Chablat | 69 |
| Max Welling | 69 |
| Ruslan Salakhutdinov | 69 |
| Aaron Courville | 65 |
| Andreas Krause | 62 |
| Trevor Darrell | 62 |
| Lawrence Carin | 61 |
| Chris Dyer | 59 |
| Mita Nasipuri | 59 |
| Nathan Srebro | 59 |
| Roland Siegwart | 58 |
| Tong Zhang | 56 |
| Yann LeCun | 56 |
| Nando de Freitas | 54 |
| Bernhard Schölkopf | 53 |
| Marcus Hutter | 52 |
| Martin J. Wainwright | 52 |
summary( rankings)## author numPapers
## Length:67625 Min. : 1.0
## Class :character 1st Qu.: 1.0
## Mode :character Median : 1.0
## Mean : 2.2
## 3rd Qu.: 2.0
## Max. :174.0
We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. By comparison, other thought leaders mentioned earlier like Kira Radinsky have written only 4 papers. We also see that the average number of papers written is 2.2 with a median of 1 papers. Thus, the distribution of publishing researchers is highly skewed to the right.
We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.
Data science covers a tremendous system of themes under its umbrella including Deep learning, IoT, AI, and different others. It is a complete amalgamation of data inference, analysis, algorithm computation and technology to take care of multifaceted business issues. With the unabated expanding notoriety of information science and new mechanical and advanced improvements, the applications and employment of information science are expanding significantly after some time. The accompanying patterns in this field are relied upon to proceed in the coming year too.
Universities contribute fundamentally to AI explore in explicit fields, with Chinese colleges dominating. Numerous AI licenses incorporate developments that can be connected in various ventures - media communications, transportation, restorative sciences, individual gadgets, processing and human-PC collaboration (HCI) included profoundly in the related enterprises.
patentsURL<-("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Patents.csv")
patents_Raw<-fread(patentsURL,fill = TRUE,header = TRUE,stringsAsFactors = FALSE)
names(patents_Raw)[1]<-"Year"
patents_Raw$Year<-as.character(patents_Raw$Year)
patents_Data<- patents_Raw %>% gather(Countries,PatentCount,-Year)
patents_Data$PatentCount[is.na(patents_Data$PatentCount)]<-0
patents_Data$Year<-as.Date(patents_Data$Year,format("%Y"))
patents_Data$Year<-year(patents_Data$Year)
patents_Data_byCountry<-patents_Data %>% group_by(Countries) %>% summarise(TotalPatents=sum(PatentCount)) %>% arrange(desc(TotalPatents))# Bar chart showing the Count of patents by Country
ggplot(patents_Data_byCountry,aes(x=reorder(Countries,TotalPatents),y=TotalPatents))+
geom_col(fill="tomato2",color="black")+
xlab("Countries")+
ylab("Total Patents")+
labs(title = "Total AI Patents by Country period: 2004-2014")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))"US is leading country in Total Patents in Artificial intelligence followed by Japan and China."Robo and DataScience
With the advance in data science, the field of robotics has definitely improved to a great extent. data science, AI, and robotics have a pretty much symbiotic relationship. Each enhances the other to power innovative machines and technologies that are making our lives more convenient than ever. The collaboration between data science, AI, and ML has given us things like self-driving cars, smart assistants, robo-surgeons and nurses, and so much more.
RoboticsURL<-"https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Robotic%20Installations.csv"
Robotics_Raw <-fread(RoboticsURL,fill = TRUE,header = TRUE,stringsAsFactors = FALSE)
Robotics_DS<-Robotics_Raw[29:nrow(Robotics_Raw),]
Robotics_Data<- Robotics_DS %>% gather(Year,No_of_Robotic_Installations,-Countries)ggplot(Robotics_Data, aes(x=Year, y=No_of_Robotic_Installations, group=Countries, color=Countries)) +
geom_line(size=2) + geom_point()+
scale_color_brewer(palette="Paired")+
xlab("Years")+
ylab("# of Robotic Installations")+
labs(title = "Robotic Installations Regionally by Year")+
theme_minimal()" From the line chart it is evident that China is leading in Robotics installations, followed by Europe, Japan, and North America. The trend is increasing every year" According to the Harvard Business Review, “companies with strong basic analytics - such as sales data and market trends - make breakthroughs in complex and critical areas after layering in artificial intelligence.” AI innovations like those aren’t possible without the right data and specialized data science staff who know how to use it. In 2014, about 30% of AI patents originated in the U.S, followed by South Korea and Japan, which each hold 16% of AI patents. Of the top inventor regions, South Korea and Taiwan have experienced the most growth, with the number of AI patents in 2014 nearly 5x that in 2004 . RAI is defined as the share of a country’s publication output in AI relative to the global share of publications in AI. A value of 1.0 indicates that a country’s research activity in AI corresponds exactly with the global activity in AI. A value higher than 1.0 implies a greater emphasis, while a value lower than 1.0 suggests a lesser focus.
RABR<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Research_activity_by_region.csv",stringsAsFactors = FALSE)
names(RABR)<-str_replace_all(names(RABR),"X","")
RABR## Region 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## 1 China 8% 6% 9% 10% 12% 14% 16% 18% 20% 19% 27%
## 2 United States 28% 27% 25% 23% 24% 23% 22% 22% 19% 18% 15%
## 3 Europe 35% 35% 35% 36% 34% 35% 33% 32% 33% 33% 30%
## 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1 29% 27% 25% 24% 24% 24% 23% 24% 25%
## 2 14% 14% 15% 15% 15% 15% 18% 17% 17%
## 3 31% 31% 31% 31% 32% 31% 32% 30% 28%
RABR<-gather(RABR,Year,Percent_of_Publications,-Region)
RABR$Percent_of_Publications<- percent(RABR$Percent_of_Publications,digits = 0)
RABR_2017<-RABR %>% filter(Year %in% c(2017,2016,2015)) ggplot(RABR_2017,aes(x=reorder(Region,Percent_of_Publications),y=Percent_of_Publications,fill=Year))+
geom_col(position="dodge",color="black")+
xlab("Region")+
ylab("% of AI publications")+
scale_fill_brewer(palette = "Dark1")+
labs(title = "Percent of AI papers by Countries: 2015 to 2017")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))## Warning in pal_name(palette, type): Unknown palette Dark1
" Europe is leading overall every year 2015 - 2017 in terms of % of AI Publications followed by China and US.
In China the trend is upward year over year although they are standing second in place"The graphs below and on the following page show the number of Scopus papers affiliated with government, corporate, and medical organizations. In 2017, the Chinese government produced nearly 4x the number of AI papers produced by Chinese corporations. China has also experienced a 400% increase in government-affiliated AI papers since 2007, while corporate AI papers only increased by 73% in the same period. In the U.S., a relatively large proportion of total AI papers are corporate. In 2017, the proportion of corporate AI papers in the U.S. was 6.6x the proportion of corporate AI papers in China, and 4.1x that of Europe.
RSector<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/Research_focus_by_Region_in_AI.csv",stringsAsFactors = FALSE)
RSector_Stg1<-gather(RSector,Country,Relative_Activity_Index,-Research.Sector)
RSector_Stg1$Country<-factor(RSector_Stg1$Country)
RSector_Stg1$Research.Sector<-factor(RSector_Stg1$Research.Sector)
RSector_Stg1<-RSector_Stg1 %>% group_by(Country) %>% mutate(label_y=cumsum(Relative_Activity_Index)-0.2*Relative_Activity_Index)library(treemap)
treemap(RSector_Stg1,index = c("Country","Research.Sector"),
vSize = "Relative_Activity_Index",
algorithm = "pivotSize",
#lowerbound.cex.labels=1.6,
title="Research Focus by Region in AI",
fontsize.labels = c(15,10),
align.labels = list(c("centre","centre"),c("left","top")))ggplot(RSector_Stg1, aes(x = Country, y = Relative_Activity_Index,
fill = Research.Sector)) +
geom_col() +
xlab("Region")+
ylab("Relative Activity Index")+
labs(title = "Research Focus by Region in AI")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))"
1. United States is leading overall in terms of the research in AI but is lagging behind in Agricultural compared to Europe and China.
2. United States is leading in Humanities in AI and Medical and Health Sector of AI compared to China and Europe.
3. China is at the top position when it comes to Engineering and Technology.
"We analyze the trends in data science topics over the last decade using data from the arXiv paper repository from 2009-2018. The data suggests that all topics of data science have experienced significant nearly exponential growth. However, if we examine the topics more closely, some areas have become hotter while others have diminished on a relative basis.
Using the arXiv research website for Data
To explore these trends, we gathered data from the arXiv research website. Arxiv website hosts a popular and longstanding forum for academic research in mathematics, physics, statistics and computer science. Researchers submit papers electronically and are catalogued in the arxiv database.
We are able to obtain detailed research paper submission data through the R package aRxiv to retrieve metadata information. This API allows us to query papers based on useful criteria such as:
The range of dates and subject classifications below are native arXiv categories. Our tags are the same as those chosen by the AIIndex.org 2018 paper in its data collection methodology. AIIndex 2018
# A list of categories and years
# -----------------------------------------------------------------------------
ds_categories = c("stat.ML", "cs.AI", "cs.CV", "cs.LG", "cs.RO", "cs.CL", "cs.NE")
ds_descriptions = c("Stat Machine Learning", "Artificial Intelligence", "Computer Vision",
"Computer Learning", "Robotics", "Computation and Language" ,
"Neural and Evolutionary Computing")
subject_names = data.frame( categories = ds_categories ,
desc = ds_descriptions, stringsAsFactors = FALSE)
years_list = c( 2009:2018 )# Set up an empty dataframe of years range for row and data science
# topics for columns. Values will store paper counts in arXiv by year and topic.
# ----------------------------------------------------------------------------------
info = data.frame (matrix( data = 0,
nrow = length(years_list),
ncol = length(ds_categories) ,
dimnames = list( as.character( years_list ), ds_categories ) ) )Collecting the paper counts
The following section downloads the paper counts by topic and year. Note that this step is computationally intensive and will cause an online resource restriction by the arXiv server if they feel that this query causes excessive or abusive use of computational resources.
As a result, we save the results to a local disk file. The following code chunk should set eval to equal TRUE to confirm the code works and allows downloads. Otherwise, for visualization graphics (or project final assembly), this step should be skipped. The downloaded data can be read from a file and the next code chunk. The data file is posted online.
#
# Query the arXiv server for paper counts:
# Outer loop is on subjects
# Inner loop is on years.
# -------------------------------------------------------------------------
for(subject in ds_categories )
{
for( y in years_list )
{
qry = paste0("cat:", subject, " AND submittedDate:[",
y,
" TO ",
y+1, "]")
qry_count = arxiv_count(qry)
info[as.character(y), subject] = qry_count
print(paste(qry, " ", qry_count, "\n"))
# Sleeping is essential to throttle API load on the arXiv server.
# ------------------------------------------------------------------
Sys.sleep(3)
}
}
# Write the contents to files to avoid re-running the above code during
# final project assembly
# ------------------------------------------------------------------------
write.csv(info, file="Arxiv_topic_counts.csv", row.names = TRUE)And reload the paper counts here.
subject_year_counts = as_tibble( read.csv("https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/Arxiv_topic_counts.csv") )Data Wrangling
Some minor data wrangling is needed to extract the marginal sums and fractions of annual production by topic. This is illustrated in the next code chunk.
Note that for each year in the Period files, the paper count is from Jan 1 of that year until Dec 31 of the same year.
# Set the names to make algebraic notation less cumbersome
# --------------------------------------------------------------------------------
names(subject_year_counts) = c("Period", as.character(subject_names$categories) )
# Calculate and store row sums
# -----------------------------------------------------------------
subject_year_counts %>%
group_by(Period) %>%
mutate( sum = stat.ML + cs.AI + cs.CV + cs.LG + cs.RO + cs.CL + cs.NE) %>%
mutate( stat.ML.pct = stat.ML / sum ,
cs.AI.pct = cs.AI / sum ,
cs.CV.pct = cs.CV / sum ,
cs.LG.pct = cs.LG / sum ,
cs.RO.pct = cs.RO / sum ,
cs.CL.pct = cs.CL / sum ,
cs.NE.pct = cs.NE / sum
) -> subject_year_counts
#
# Plot the change in percent importance of different topics over the last 10 years
# ----------------------------------------------------------------------------------
subject_year_counts %>% select( Period, stat.ML.pct:cs.NE.pct) %>%
gather(key="Subject", value = "fraction", stat.ML.pct:cs.NE.pct) -> pct_data
ggplot( pct_data , aes(x=Period, y = fraction, fill= Subject ) ) +
geom_bar(stat="identity", position="fill") +
scale_fill_brewer(palette="Set2") +
scale_x_discrete(limits=2009:2018) +
ggtitle("Percent of Data Science Papers by Topic on arXiv from 2009-2018")# Show only 2009 and 2018 statistics and merger with longer descriptions
# Then display data by year-as-column to focus on changes
# ---------------------------------------------------------------------------------
pct_data %>% filter( Period == 2009 | Period == 2018 ) %>%
mutate( topicCode = str_sub( Subject, start= 1, end = -5), fraction = 100 * fraction) %>%
inner_join( subject_names, by = c("topicCode" = "categories") ) %>%
select( Period, fraction, desc ) %>%
spread( key = Period, value = fraction ) -> table_to_showThe table below clearly shows significant changes in relative interest over a decade.
knitr::kable(table_to_show, digit = 1 ,
caption = "Percent Share of Articles by Topic" ) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| desc | 2009 | 2018 |
|---|---|---|
| Artificial Intelligence | 31.9 | 11.1 |
| Computation and Language | 7.6 | 9.7 |
| Computer Learning | 19.2 | 27.2 |
| Computer Vision | 12.3 | 22.3 |
| Neural and Evolutionary Computing | 9.9 | 3.1 |
| Robotics | 5.4 | 4.8 |
| Stat Machine Learning | 13.6 | 21.8 |
Trends in the Total Volume of Research
The evidence below will show exponential growth in data science research. Statistic and machine learning and computer vision are driving the bulk of this work.
#
# Plot the change in absolute papers submitted
# -----------------------------------------------------------------
subject_year_counts %>% select( Period, stat.ML:cs.NE) %>%
gather(key="Subject", value = "Count", stat.ML:cs.NE) -> abs_data
ggplot( abs_data , aes(x=Period, y = Count, fill= Subject ) ) +
geom_area() +
scale_fill_brewer(palette="Set2") +
scale_x_discrete(limits=2009:2018) +
ggtitle("Count of Data Science Papers by Topic on arXiv from 2009-2018") +
theme(legend.position= c(.1, .9 ),
legend.justification = c("left", "top"))Explosive Growth of Research
knitr::kable(subject_year_counts %>% select( Period, sum ) %>% spread( key = Period, value = sum ) ,
caption = "Total Data Science Articles on Arxiv by Year" ) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|---|---|---|---|
| 1197 | 1679 | 2460 | 4819 | 6032 | 6815 | 9528 | 15124 | 22492 | 38433 |
A simple calculation from the above table allows us toconclude that the volume of research (by article count) has grown 47 percent annually. This explosive growth in research has resulted in a 32 fold increase in research in the most recent decade. Whether the quality matches the quantity is another issue. But this is compelling supporting evidence that artificial intelligence is revolutionizing academic research and thought leadership.
The section below loads the required packages and input data files containing the following details of research papers in Artificial Intelligence (AI), Data Science (DS), Machine Learning (ML), Visualization (VI): Names of faculty members who authored these papers, universities they’re affiliated to, conferences they presented at, and the year of publication. The conferences selected were restricted to those in the above fields only.
fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/author-info-ds.csv"
ds.authors<-read.csv(file=fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = TRUE)
#ds.authors<-fread(fileurl, header=TRUE, na.strings = "NA", stringsAsFactors = FALSE)
#ds.authorsThe section below filters the data for the 2 main columns of interest: University and Adjusted Count. The adjusted count is a score that measures contribution by authors based on joint ownership with other authors. The detailed methodology is available at http://csrankings.org/#/index?all
The analysis below aims to determine which university across the globe, can be deemed to be a thought leader in the Data Science, Artificial Intelligence, Machine Learning and Visualization areas, based on the contribution of their faculty members by writing research papers.
The following section selects the required columns, and calculates an aggregate of the adjusted count by university.Then it renames the columns, and derives the top 10 universities based on this adjusted count metric.
paper.count<-ds.authors%>%select(university,adjustedcount)
summary.paper.count<-aggregate(. ~ university, data = paper.count, sum)%>%setorder(-adjustedcount)
summary.paper.count$adjustedcount<-round(summary.paper.count$adjustedcount,2)
colnames(summary.paper.count)<-c("University", "Research_Adj_Count")
#summary.paper.count
top10<-summary.paper.count[1:10,]
top10## University Research_Adj_Count
## 26 Carnegie Mellon University 515.84
## 254 University of California - Berkeley 250.29
## 352 University of Tokyo 239.10
## 65 Georgia Institute of Technology 204.46
## 122 Massachusetts Institute of Technology 200.75
## 338 University of Southern California 198.27
## 328 University of Pennsylvania 194.29
## 242 University of Alberta 173.05
## 212 TU Munich 172.86
## 38 Cornell University 167.04
The following section generates a barplot showing the top 10 universities by adjusted count of research papers. It shows that Carnegie Mellon sits at the top of the stack by a big margin. So it can considered as the prime thought leader from an institutional perspective.
As a topic for further research, it is notable that a big name like Stanford University is missing from the top 10 universities. We suspect that this could be on account of factors like departmental affiliations of faculty members and their choices on whether to present their research at pure Data Science type conferences vis-a-vis other conferences geared towards other domains such as Statistics or Economics. Also, it’s likely that if the focus is extended to all of Computer Science instead of a narrower selection of AI, DS, ML etc, then universities like Stanford may show a more significant presence while perhaps the universities in the top 10 are more exclusively focusing on AI, ML and DS research.
library(ggplot2)
library(RColorBrewer)
ggplot(top10, aes(x=reorder(University,Research_Adj_Count), y=Research_Adj_Count, fill=University))+ geom_bar(stat="identity",color="black") + coord_flip() + theme(legend.position='none') + ylab("Adjusted Research Paper Count") + xlab("Universities as Thought Leaders")From the data and graph below, it can be seen that Machine Learning, Neural Networks and Computer Visualization are showing a very rapid growth in research publications.
fileurl3<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/AIPapersByTopic.csv"
ai.subtopics<-read.csv(fileurl3)
colnames(ai.subtopics)<-c("Year", "Machine Learning", "Neural Networks", "Computer Vision", "Search optimization", "NLP", "Fuzzy Systems", "Decision Making", "Total")
ai.subtopics## Year Machine Learning Neural Networks Computer Vision
## 1 1998 4319 4680 3460
## 2 1999 4826 5569 3868
## 3 2000 4636 5383 4087
## 4 2001 5191 5659 4430
## 5 2002 6189 6320 5366
## 6 2003 7325 6819 6186
## 7 2004 10256 8590 8132
## 8 2005 11743 9520 9274
## 9 2006 13068 10487 10670
## 10 2007 15453 11594 12849
## 11 2008 19786 14454 16942
## 12 2009 16699 14270 14041
## 13 2010 17122 14302 14298
## 14 2011 16962 13853 14193
## 15 2012 17657 13601 13451
## 16 2013 18243 14325 13629
## 17 2014 21263 15851 15877
## 18 2015 25289 17683 18854
## 19 2016 28785 22301 21330
## 20 2017 34461 29584 25006
## Search optimization NLP Fuzzy Systems Decision Making Total
## 1 1396 2087 2362 771 19075
## 2 1597 2142 2605 872 21479
## 3 1765 2424 2471 1096 21862
## 4 2099 2597 2923 1492 24391
## 5 2427 3018 2894 1833 28047
## 6 2885 3780 3344 2531 32870
## 7 3711 4976 4189 3239 43093
## 8 4094 5715 4161 3911 48418
## 9 4847 6361 4587 3901 53921
## 10 6039 6717 5144 4162 61958
## 11 7598 8381 6717 4436 78314
## 12 7193 7776 7160 4186 71325
## 13 7595 8117 7002 3787 72223
## 14 7379 7124 6162 4454 70127
## 15 7026 7302 5310 3293 67640
## 16 7255 6635 5330 3388 68805
## 17 7423 7816 5210 3278 76718
## 18 8195 9339 5597 3825 88782
## 19 8816 9810 6390 3881 101313
## 20 9333 9099 6290 3892 117665
ai.subtopics.long<-gather(ai.subtopics, key="Sub_Topic", value=Paper_Count, 2:8)
ggplot(ai.subtopics.long, aes(x=Year, y=Paper_Count, group=Sub_Topic,colour=Sub_Topic)) + geom_line()+xlab("Years")+ylab("Paper Count")fileurl2<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/USAI-MLUndergradEnrolmentPercentage.csv"
undergrad<-read.csv(fileurl2)
colnames(undergrad)<-c("University", "Domain", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017")
#undergrad
undergrad.long<-gather(undergrad, key="Year", value=Percent_Share, 3:10)
undergrad.long$UnivDomain=paste(undergrad.long$University, undergrad.long$Domain, sep="-")
undergrad.long$Percent_Share=round(undergrad.long$Percent_Share*100,2)
undergrad.long## University Domain Year Percent_Share UnivDomain
## 1 Berkeley AI 2010 2 Berkeley-AI
## 2 Stanford AI 2010 3 Stanford-AI
## 3 UIUC AI 2010 1 UIUC-AI
## 4 UW AI 2010 0 UW-AI
## 5 Berkeley ML 2010 0 Berkeley-ML
## 6 Stanford ML 2010 5 Stanford-ML
## 7 UIUC ML 2010 0 UIUC-ML
## 8 UW ML 2010 0 UW-ML
## 9 Berkeley AI 2011 2 Berkeley-AI
## 10 Stanford AI 2011 3 Stanford-AI
## 11 UIUC AI 2011 0 UIUC-AI
## 12 UW AI 2011 0 UW-AI
## 13 Berkeley ML 2011 0 Berkeley-ML
## 14 Stanford ML 2011 5 Stanford-ML
## 15 UIUC ML 2011 0 UIUC-ML
## 16 UW ML 2011 0 UW-ML
## 17 Berkeley AI 2012 3 Berkeley-AI
## 18 Stanford AI 2012 3 Stanford-AI
## 19 UIUC AI 2012 1 UIUC-AI
## 20 UW AI 2012 0 UW-AI
## 21 Berkeley ML 2012 0 Berkeley-ML
## 22 Stanford ML 2012 8 Stanford-ML
## 23 UIUC ML 2012 0 UIUC-ML
## 24 UW ML 2012 0 UW-ML
## 25 Berkeley AI 2013 3 Berkeley-AI
## 26 Stanford AI 2013 3 Stanford-AI
## 27 UIUC AI 2013 1 UIUC-AI
## 28 UW AI 2013 1 UW-AI
## 29 Berkeley ML 2013 1 Berkeley-ML
## 30 Stanford ML 2013 10 Stanford-ML
## 31 UIUC ML 2013 0 UIUC-ML
## 32 UW ML 2013 0 UW-ML
## 33 Berkeley AI 2014 4 Berkeley-AI
## 34 Stanford AI 2014 5 Stanford-AI
## 35 UIUC AI 2014 1 UIUC-AI
## 36 UW AI 2014 1 UW-AI
## 37 Berkeley ML 2014 1 Berkeley-ML
## 38 Stanford ML 2014 11 Stanford-ML
## 39 UIUC ML 2014 1 UIUC-ML
## 40 UW ML 2014 1 UW-ML
## 41 Berkeley AI 2015 4 Berkeley-AI
## 42 Stanford AI 2015 8 Stanford-AI
## 43 UIUC AI 2015 1 UIUC-AI
## 44 UW AI 2015 1 UW-AI
## 45 Berkeley ML 2015 3 Berkeley-ML
## 46 Stanford ML 2015 13 Stanford-ML
## 47 UIUC ML 2015 0 UIUC-ML
## 48 UW ML 2015 1 UW-ML
## 49 Berkeley AI 2016 3 Berkeley-AI
## 50 Stanford AI 2016 9 Stanford-AI
## 51 UIUC AI 2016 1 UIUC-AI
## 52 UW AI 2016 1 UW-AI
## 53 Berkeley ML 2016 3 Berkeley-ML
## 54 Stanford ML 2016 9 Stanford-ML
## 55 UIUC ML 2016 1 UIUC-ML
## 56 UW ML 2016 1 UW-ML
## 57 Berkeley AI 2017 4 Berkeley-AI
## 58 Stanford AI 2017 12 Stanford-AI
## 59 UIUC AI 2017 3 UIUC-AI
## 60 UW AI 2017 2 UW-AI
## 61 Berkeley ML 2017 2 Berkeley-ML
## 62 Stanford ML 2017 13 Stanford-ML
## 63 UIUC ML 2017 2 UIUC-ML
## 64 UW ML 2017 1 UW-ML
The following graph shows the academic enrolment at the undergraduate level in AI and ML courses in selected US universities, over the 2010-2017 period. From this, it can be seen that academic enrolment has been trending up over the past few years in these universities, which can be seen as representative across all US universities.
library(ggplot2)
ggplot(undergrad.long, aes(x=Year, y=Percent_Share, group=UnivDomain,colour=UnivDomain)) + geom_line()+xlab("Years")+ylab("Percent of Total")The following section collects the inputs: regional percentage share of Artificial Intelligence publications over the 1998-2017 period. The data is loaded and tidied from a wide format to a long format, setting it up for further analysis.
fileurl<-"https://raw.githubusercontent.com/completegraph/DATA607PROJ3/master/Code/RegionalShareofAIPublications.csv"
regional.ai<-read.csv(fileurl)
colnames(regional.ai)<-c("Region","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011", "2012","2013","2014","2015","2016","2017")
#regional.ai
regional.ai.long<-gather(regional.ai, key="Year", value=Percentage_Share, 2:21)
regional.ai.long## Region Year Percentage_Share
## 1 China 1998 0.08
## 2 United States 1998 0.28
## 3 Europe 1998 0.35
## 4 Rest of World 1998 0.28
## 5 China 1999 0.06
## 6 United States 1999 0.27
## 7 Europe 1999 0.35
## 8 Rest of World 1999 0.32
## 9 China 2000 0.09
## 10 United States 2000 0.25
## 11 Europe 2000 0.35
## 12 Rest of World 2000 0.31
## 13 China 2001 0.10
## 14 United States 2001 0.23
## 15 Europe 2001 0.36
## 16 Rest of World 2001 0.31
## 17 China 2002 0.12
## 18 United States 2002 0.24
## 19 Europe 2002 0.34
## 20 Rest of World 2002 0.30
## 21 China 2003 0.14
## 22 United States 2003 0.23
## 23 Europe 2003 0.35
## 24 Rest of World 2003 0.28
## 25 China 2004 0.16
## 26 United States 2004 0.22
## 27 Europe 2004 0.33
## 28 Rest of World 2004 0.28
## 29 China 2005 0.18
## 30 United States 2005 0.22
## 31 Europe 2005 0.32
## 32 Rest of World 2005 0.27
## 33 China 2006 0.20
## 34 United States 2006 0.19
## 35 Europe 2006 0.33
## 36 Rest of World 2006 0.28
## 37 China 2007 0.19
## 38 United States 2007 0.18
## 39 Europe 2007 0.33
## 40 Rest of World 2007 0.30
## 41 China 2008 0.27
## 42 United States 2008 0.15
## 43 Europe 2008 0.30
## 44 Rest of World 2008 0.28
## 45 China 2009 0.29
## 46 United States 2009 0.14
## 47 Europe 2009 0.31
## 48 Rest of World 2009 0.26
## 49 China 2010 0.27
## 50 United States 2010 0.14
## 51 Europe 2010 0.31
## 52 Rest of World 2010 0.28
## 53 China 2011 0.25
## 54 United States 2011 0.15
## 55 Europe 2011 0.31
## 56 Rest of World 2011 0.29
## 57 China 2012 0.24
## 58 United States 2012 0.15
## 59 Europe 2012 0.31
## 60 Rest of World 2012 0.30
## 61 China 2013 0.24
## 62 United States 2013 0.15
## 63 Europe 2013 0.32
## 64 Rest of World 2013 0.29
## 65 China 2014 0.24
## 66 United States 2014 0.15
## 67 Europe 2014 0.31
## 68 Rest of World 2014 0.29
## 69 China 2015 0.23
## 70 United States 2015 0.18
## 71 Europe 2015 0.32
## 72 Rest of World 2015 0.28
## 73 China 2016 0.24
## 74 United States 2016 0.17
## 75 Europe 2016 0.30
## 76 Rest of World 2016 0.30
## 77 China 2017 0.25
## 78 United States 2017 0.17
## 79 Europe 2017 0.28
## 80 Rest of World 2017 0.30
The following section shows a regional breakdown of AI papers published on Scopus for the uears 1998-2017. The source of this data is Elsevier. The broad regional categories are: USA, Europe, China and Rest of World (RoW). Based on this, it can be seen that Europe is the leading contributor to papers and publications in this domain over the years followed closely by RoW. China can be seen steadily increasing its share of research publications in this area. Based on this metric, Europe can be considered as the Thought Leader from a regional perspective.
ggplot(regional.ai.long, aes(x=Year, y=Percentage_Share, fill=Region)) +
geom_bar(stat="identity", colour="black") +
guides(fill=guide_legend(reverse=TRUE)) +
scale_fill_brewer(palette="Pastel1") + theme(text = element_text(size=11),axis.text.x = element_text(angle=90, hjust=1))Based on popular narrative, Andrew Ng and Kira Radinsky can be considered to be two of the primary thought leaders in Data Science. Over the pasy few years, they have shifted their focus from AI to more specific topics of specialisations such as deep learning, predictive analytics, and speech recognition.
We conclude that the top influential data scientist by paper count is Yoshua Bengio with 174 papers. He is noted for his expertise in deep learning along with Geoffrey Hinton and Yann LeCun. We conclude that thought leadership within the field of academic research does not equate to business thought leadership. However, without conceptual innovations made possible by academia, the application of these ideas to business is impossible.
Regarding sub-topics being researched, Computer / Machine Learning and Computer Vision have grown in popularity at the cost of pure AI, while Neural Computing and Robotics have remained largely stable.
When it comes to countries and regions, the USA leads in terms of patents in DS/AI/ML, while Europe leads in terms of generating research publications. China is catching up rapidly - for example Tsingshua University led the way in terms of research papers in this domain during 2018. Universities in the USA are showing a steady increase in enrolment in AI/ML courses, and Carnegie Melon continutes to lead the way in research in this domain.
We relied on a variety of sources for the data used in our analysis, such as the AI Index Report for 2018, the CS Rankings, aRxiv and ACM websites, as well as personal websites of individuals such as Andrew Ng. In terms of future research, we think that understanding what’s driving the changes in topics of interest for popular data science professionals as well as academia would be a good area to focus on.