Source Cred EDA

Introduction

This document presents an exploratory data analysis of the Source Cred graph data located at GitHub. This analysis exists an RMarkdown file that can be rerun on demand given new data available.

The data holds 6,080 nodes and 15,929 edges. For the different node and edge types, there are 83 unique node pairings.

Node Types

Nodes are stored as an array of strings following this basic pattern:

{"sourcecred","git","COMMIT","007cf88172d7ea9b0cdada78f124f7a41b811b30"}

To determine the node types, the first 3-4 string objects in the node were concatenated. This yielded 9 unique node types summarized by counts in the graph below.

Edge Types

Edges are stored as a nested array with an address and two index numbers relating to the connected nodes. The address nests the data relating to the edge type. An example of an edge looks like (commit hashes have been truncated for reading ease):

{
  "address": [
    "sourcecred", "git","HAS_PARENT", "2","COMMIT","007cf881...",  "2","COMMIT","d310561b..."
    ],
  "dstIndex": 744,
  "srcIndex": 0
}

To determine the edge types, the first 3-4 string objects in the node were concatenated. This yielded 10 unique edge types summarized by counts in the graph below.

Node Pairings

To get a better sense of how nodes and egdes work together to form the network, we can look at the node pairings by edge type. Currently, there are 83 pairings summarized by counts in the table below.

Node Word Tokens

While the node types appear to clearly identify all the important information about nodes, a text based analysis was employed to confirm that node types are the predominant data stored in the nodes. The text analysis tokenized the words in the node data and generated word counts across all nodes. The top 10 words are are summarized in the chart below.

Edge Word Tokens

While the edge types appear to clearly identify all the important information about edges, a text based analysis was employed to confirm that edge types are the predominant data stored in the edges. In addition, user names begin to show up more dominantly in edges than in nodes.

The text analysis tokenized the words in the edge data and generated word counts across all edges. The top 20 words are are summarized in the chart below.

Recommendations

Given the data available, the following recommendations are made for visualizing the data:

Node and Edge Types provide a stable view of the data that can easily translate to a data visualization
Edge data provides greater information regarding the who and what of the network graph
Adding node and edge scoring would improve the understanding of the network
Labelling would require a subset of data as the current data set is too large
Data visualizations should include at least the option for the following elements:
1. Node types by color
2. Edge types by color
3. Node scores/weights by radius size
4. Edge scores/weights by stroke width
5. Popover capabilities for displaying more in-depth information and labels
6. Data management support for unnamed and nested node/edge data