There is no single natural way of analyzing the graph created by the collaborative work of a team of software developers as reflected in the log of that repository. In this paper we will analyze several graphs created by this work, considering author or files the nodes in what is actually a bipartite network. We will try to find out which network better represents the complex nature of the technosocial system.
Beehive is a Go event and agent system available with a free license from https://github.com/muesli/beehive.
The repository can be represented as a bipartite graph with authors and files, with non-directed edges connecting authors with those files they have worked on. There are several possible ways of turning this bipartite graph into a one-party graph: create a graph of files joined by one edge if they have one developer in common, or joining authors by one edge if they have worked on at least one file. In this report our objective is to present a methodology to mine the social aspects of software repositories and some initial results, showing its strenghts and its limitations.
What every software project has in common is the history contained in the source control management log. This information can be extracted from the repository, once it has been downloaded, using git log. This command gives information on the files that have been changed during that commit, size and places where those changes have taken place, and the author of the commit. Although a commit might include work from several authors at once, if it is combined in a single change set the author will be the one that has actually merged the change set in a single commit; it all boils down to a commit including a single author and one or more files. After establishing the self-organized criticality of some repos examining these commit logs in papers such as Merelo-Guervós (2016) in this paper we will focus on the objects in the repository itself.
Besides pure repository metadata, some hosted repositories have additional ways of interacting, such as using issues or pull requests. However, this is more difficult to measure, it is not general and, eventually, works will have to be reflected in the repository itself, so we will not be using that information.
The repository information can be modeled as a bipartite graph with two types of nodes: authors and files. However, these type of repositories are more difficult to manage and we will turn them into single-party graphs. In order to do that we need also a univocal representation of the author’s name; commit metadata is extracted from general repository configuration files, and it might change depending on the computer the developer is using. One commit migh be signed by A. U. Thor <author@gmail.com> and another by Andrew Uri Thor <thor@company.biz>. This is why we do additional processing on the author names, using GitHub Search API to look the name of the author up and turn it into a single GitHub nick. This is, however, not always possible and in some cases we will find a single author represented by two different nodes. Edges that represent coauthorship of a file are undirected.
On the other hand, file names might have the same problem since they can vary along the project history. However, tracking the name of a file is a more difficult task whose outcome might not be so relevant to our study, so we will not do it for the time being.
Coming up next, we will examine the results of applying this technique to the above mentioned repository, Beehive.
The first natural way of relating files is by collapsing the bipartite graph into a single party creating an edge between all files that have been touched by a developer. This is shown below.
However, the file graph is not entirely satisfactory since, in fact, the main developer might have touched all files, and, second, files completely unrelated might be joined since they are created or modified sequentially by the same developer. In fact, there is file above that is related to most files in the repository. That is why we will extract another graph from the repository by linking two files by an edge only if they appear in the same commit. This is represented below.
The effect of the graph generated this way is different, with still a single file at the center of the graph, but more clear communities of files emerging and a more complex graph structure.
In this report we have presented a method for extracting social graphs from repository logs, helped by the search API of the repository hosting site, GitHub, to uniquely identify authors. It uses all files and all commits in the repo and it results, at least in this case, in a graph with a complex structure which reveals the complex nature of the system of developers it is a log of. Being as it is a relatively simple repository, it remains to be seen if it will translate in the same way to bigger repositories. Also it gives maybe too much importance to common files, Makefiles and READMEs, which are developed by most of the users in a repository. It remains to be seen if the structure of de developers network is more clearly revealed by filtering out those files. This is left, however, as future work.
Merelo-Guervós, Juan-J. 2016. “Self-Organized Criticality in Repository-Mediated Projects.” GeNeura group, University of Granada. https://www.academia.edu/28083421/Self-organized_criticality_in_repository-mediated_projects.