Explanation.utf8

Visualizing AI / ML Papers

Motivation:

AI / ML have been progressing rapidly in the last 10 years and there is an avalanche of papers that are published in this space. Of the many papers that are published, a number of research papers have made their research codes open source which give like-minded academics as well as industry practitioners a means to build on their research. Papers With Code (https://paperswithcode.com/) is a website that tracks AI / ML papers with open source code implementation.

As a person learning about AI, I’m very curious about what are the particular search subdomains that are actively researched and published. Hence, I have designed an interactive interface to allow one to quickly filter, search and access this corpus of data. Also, I have made a couple of visualizations to give users a quick visual sense of the activity in the field.

Ps. I might have misread the assignment and spent more time designing the interactivity of the interface than the visualization as I always thought about datatables as just an alternative view to graphs. Hopefully I have designed enough into the interactivity of the main page to showcase the Shiny skills.

Describe the major data and design challenges faced in accomplishing the task and how you plan to overcome these challenges with a proposed sketched design. (3 marks)

Data Challenges:

Data published by Papers with Code tracks a large number (148,071) of AI/ML research papers. However, only a third of these have code bases that are associated with them.
- Solution: Use python to preprocess and use an inner join between between the larger set of all papers with their associated meta data with a smaller subset of .json that tracks all the papers that have code. Only 47,570 such rows are left.
Resulting dataset have many rows that have invalidate data, eg, no title, no authors, etc. or null data.
- Solution: Clean up rows that has invalide or erroneous data. Only 33,616 rows (representing the same number of research papers) are finally usable
Most of the data fields are textual in nature, hence it is hard to get any summary statistics and numerical value for plots.
- Solution: Use Framework and Sub-domains, which contains a large number of repeated categorical values to generate frequency and countsl data for plots. Details in the accompanying python file.
Freshness of data is important when analysing research trends
- Solution: .py script is provided along side Shiny WebApp for timely data preparation.

Design Challenge:

Interface has to be useful for information retrieval besides being a visualization, so users can easily access the papers and code that they are interested in
- Solution: Treat each row as a data item. As users interact with the interface, and click on each paper in the datatable (“DT”), provide the urls of the papers and code repository interactively.
There is a lot of information extracted from the raw jason files with duplicate info. Not all the information/complexity should be presented at once, otherwise it would overwhelm the users. Also filters, and search has to be on meaningful fields.
- Solution: Have a layered present view, where more detailed data, e.g. authors, abstract, url is progressively presented to the user as they interact with the interface. Provide users with the most important research filters.
To AI research, it is perhaps most important understand which sub-domains have the most and least research papers published.
- Solution: Create a word cloud visualization that allows use to get the at a glance view on which research topics are popular.

image_tooltip

2. Provide step-by-step description on how the data visualization was prepared by using ggplot2 and other related R packages. (3 marks)

Step 0 - Data prep | The data was first extracted downloaded from https://paperswithcode.com/about. The “Papers with Abstract” and “Links between papers and code” jsons are loaded into pandas and joined together, cleansed and preped, creating the file ‘paperswithcode_clean.csv’ that represents the main body of the data and ‘subdomains.csv’ that represents the frequency count of each AI / ML sub domain.

Step 1: Using the “DT” library, I first create a low-fi draft presentation of the interface together with the left panel to get a sense of 1) which columns of the data table I should present to the data table will be most information and 2) get a sense if the lay-out makes sense.

First using GGplot 2, I tried to create the main visualization which is the plot of Hashtag counts by their respective count based on static data. As shown in the plot below the chart is not interactive and only shows the top 15 hashtags.

image_tooltip

Step 2: Using the Shiny Reactive function and the “dplyr”library, I added the logic to allow UI elements to respond to user filters and selection on the fly. The 3 filter capabilities bind the values of the “Popular subdomain” dropdown, “Published Date Range” date picker and “ML Framework Used“ radio button to filter the responsive datatable

image_tooltip

UI Logic:

image_tooltip

Server Logic:

image_tooltip

Step 3: Next, added another layer of detail for each research in the white space reserved below the DataTable to give users the ease of quickly getting a good overview of each paper. Furthermore, the linked information is dynamically generated based on the user’s selection within the DataTable.

Originally, I wanted to use an iframe to present the code and paper to make the entire experience more immersive and intuitive, but was not able to find a light client side workaround for the “x-frame-options” issue. There is an aggressive server side solution to pre-cache some of the research pdfs and html using a headless browser; however, as this was not the focus of the module, I didn’t implement it.

image_tooltip

UI Logic:

image_tooltip

Server Logic:

image_tooltip

Step 4: Last but not least, I created a word cloud plot using the “wordcloud2” plot library and a barchart, ussing gg2plot, to present the popularity of key research topics.

image_tooltip

UI Logic:

image_tooltip

Server - logic:

image_tooltip

3. The final data visualization and a short description of not more than 350 words. The description must provide at least two useful information revealed by the data visualization. (4 marks)

33,616 papers with open source codes are analyzed for this visualization and data exploration. Build with data from: https://paperswithcode.com/

Main page: Allows the users to:
- Filter the AI/ML papers by 1) popular sub-domain, 2) date of publications and 3) frameworks that the research is based on.
- Select a paper on the table table to reveal more details like Abstract the links to the actual Paper/Code Repository
Word Count and Frequency Plots: Presents a visualization on which sub-domains are highly popular

The final visualization consists of 2 key components.

image_tooltip

(Observation 1 / Component 1) In the word cloud component, we can have a very visual sense of which are the most popular AI research topics. In this case, we can see “Question Answering”, “Language Modelling”, “Speech Recognition”, “Word Embedding”, etc are some of the AI Subdomains that have a very high frequency. Only from this visualization can we see that these are topics that are in the subset of Natural Language processing. On mouse over, we can see that there are 752 papers that are in the domain of “Question Answering”. We can also see that other sub-fields are less represented.

image_tooltip

(Observation 2 / Component 2) In the second DataTable component, for example, I am very interested in tensorflow as a framework because I’m familiar with it and I wanted to know what are the “Semantic Segmentation” code bases available for me to experiment and use. I can first set the filter in (A). With the filters (B) will show only all the relevant papers and I can see that there are 268 papers with code written on this (B*). By clicking on the paper that I’m interested in, more details are displayed below.

If I’m sufficiently interested in the paper, I can click on the blue button to get the link to the source code or the red button to read the paper in detail.