Motivation:
AI / ML have been progressing rapidly in the last 10 years and there is an avalanche of papers that are published in this space. Of the many papers that are published, a number of research papers have made their research codes open source which give like-minded academics as well as industry practitioners a means to build on their research. Papers With Code (https://paperswithcode.com/) is a website that tracks AI / ML papers with open source code implementation.
As a person learning about AI, I’m very curious about what are the particular search subdomains that are actively researched and published. Hence, I have designed an interactive interface to allow one to quickly filter, search and access this corpus of data. Also, I have made a couple of visualizations to give users a quick visual sense of the activity in the field.
Ps. I might have misread the assignment and spent more time designing the interactivity of the interface than the visualization as I always thought about datatables as just an alternative view to graphs. Hopefully I have designed enough into the interactivity of the main page to showcase the Shiny skills.
Data Challenges:
Design Challenge:
2. Provide step-by-step description on how the data visualization was prepared by using ggplot2 and other related R packages. (3 marks)
Step 0 - Data prep | The data was first extracted downloaded from https://paperswithcode.com/about. The “Papers with Abstract” and “Links between papers and code” jsons are loaded into pandas and joined together, cleansed and preped, creating the file ‘paperswithcode_clean.csv’ that represents the main body of the data and ‘subdomains.csv’ that represents the frequency count of each AI / ML sub domain.
Step 1: Using the “DT” library, I first create a low-fi draft presentation of the interface together with the left panel to get a sense of 1) which columns of the data table I should present to the data table will be most information and 2) get a sense if the lay-out makes sense.
First using GGplot 2, I tried to create the main visualization which is the plot of Hashtag counts by their respective count based on static data. As shown in the plot below the chart is not interactive and only shows the top 15 hashtags.
Step 2: Using the Shiny Reactive function and the “dplyr”library, I added the logic to allow UI elements to respond to user filters and selection on the fly. The 3 filter capabilities bind the values of the “Popular subdomain” dropdown, “Published Date Range” date picker and “ML Framework Used“ radio button to filter the responsive datatable
UI Logic:
Server Logic:
Step 3: Next, added another layer of detail for each research in the white space reserved below the DataTable to give users the ease of quickly getting a good overview of each paper. Furthermore, the linked information is dynamically generated based on the user’s selection within the DataTable.
Originally, I wanted to use an iframe to present the code and paper to make the entire experience more immersive and intuitive, but was not able to find a light client side workaround for the “x-frame-options” issue. There is an aggressive server side solution to pre-cache some of the research pdfs and html using a headless browser; however, as this was not the focus of the module, I didn’t implement it.
UI Logic:
Server Logic:
Step 4: Last but not least, I created a word cloud plot using the “wordcloud2” plot library and a barchart, ussing gg2plot, to present the popularity of key research topics.
UI Logic:
Server - logic:
3. The final data visualization and a short description of not more than 350 words. The description must provide at least two useful information revealed by the data visualization. (4 marks)
33,616 papers with open source codes are analyzed for this visualization and data exploration. Build with data from: https://paperswithcode.com/
The final visualization consists of 2 key components.
(Observation 1 / Component 1) In the word cloud component, we can have a very visual sense of which are the most popular AI research topics. In this case, we can see “Question Answering”, “Language Modelling”, “Speech Recognition”, “Word Embedding”, etc are some of the AI Subdomains that have a very high frequency. Only from this visualization can we see that these are topics that are in the subset of Natural Language processing. On mouse over, we can see that there are 752 papers that are in the domain of “Question Answering”. We can also see that other sub-fields are less represented.
(Observation 2 / Component 2) In the second DataTable component, for example, I am very interested in tensorflow as a framework because I’m familiar with it and I wanted to know what are the “Semantic Segmentation” code bases available for me to experiment and use. I can first set the filter in (A). With the filters (B) will show only all the relevant papers and I can see that there are 268 papers with code written on this (B*). By clicking on the paper that I’m interested in, more details are displayed below.
If I’m sufficiently interested in the paper, I can click on the blue button to get the link to the source code or the red button to read the paper in detail.