Overview
Welcome to the inaugural edition of “As the COG Turns”, the monthly COG Team Report!
As the COG Turns is an iterative work in progress, intended for IBM Cognitive Opentech (COG) (although certainly shareable beyond that).
This report seeks to provide insight into the following questions:
- What do AI Communities care about?
- What do organizations engaged with AI Communities care about?
- What impact is our team having in AI Communities?
The report lives in an internal Github repository( (Requires IBM Login) and is available both as R markdown and HTML through Github pages.
To give feedback on the report, including content suggestions, please file a Github issue (Requires IBM Login).
Dear Gentle Reader,
Emminence is an ambiguous term that is tricky to define and as a result even trickier to measure. The quantitative analysis presented here is intended to help us clarify our questions and inform our discussions on our own efforts towards better understanding what Emminence means for us, both as individuals and as an organization.
The sign of a good metric is one that inspires more questions than it was intended to answer. While this initial edition is pretty bare bones, I hope it at least inspires a few good questions (which I hope you submit as Github issues!).
Happy Spelunking!
Augustina
AI Community Trends
What organizations are most visibly active in contribution activities for projects of interest? This does not necessarily indicate that one organization is more influential than another. Having a clearly identifiable affiliation suggests a high degree of “official” interest that could be followed up on.
This section looks specifically at commit and event activity for projects of interest. This version of the report only looks at specific repositories but future version will do broader analysis at the overall project level. If you would like to add a project of interest to the list, submit an issue to the “As the COG Turns” Github repo (Requires IBM Login)!
Commit Logs
What organizations had identifiable commit activity in projects of interest? Organizations were identified using the domain name in the commit log. When considering proportion of authors, a network was used to distinguish single authors using multiple email addresses.
A domain lookup service was used to identify organizations 1 2 Public status was verified by looking up the stock ticker symbol using the name reported from the domain lookup service. 3
Grouping organizations by type and looking at the proportion of activity we can identify on different projects may give us insight on how developers are engaging with different projects. Tensorflow is hugely popular and shows a high proportion of authors from several types of affiliations.
Possible Hypotheses based on this plot:
- The presence of “education” authors could indicate wider adoption of the project.
- The presence of “education” authors could indicate a more innovative project being used in research.
- Projects with a high proportion of “private” or “public” authors could indicate that only the authoring company is actively engaged with the project.
- For projects with less type diversity, a high proportion of “personal” authors could indicate an affiliation failure.
- For projects with higher type diversity, a high proportion of “personal” authors could indicate more developer engagement.
Proportion of authors grouped by Affiliation Type
Another view of type diversity is presented below. Organizations are grouped by type and whether they had at least one commit present in February or January. The width the the type indicates how many organizations had at least one commit identified in each month. A wider bar suggests more organizations of a particular type had a commit present. If an organization had a commit in both January and February, it is counted twice.
This view also has the advantage of showing how active a project is.
Affiliation Types having at least one commit
Stratifying by type makes it easier to see what companies had identifiable activity in the repository. Public companies are either identified as such by the domain service or identified through the present of a stock ticker symbol.
Tensorflow and Pytorch show the most diversity among public companies. Of particular interest is affiliations across projects - Google and Intel show the most breadth in this group.
Public Companies by name
The following plot shows private companies identified as having at least one commit in each of the listed repositories. Tensorflow, Pytorch, and Caffe2 show the highest affiliation diversity in this type. Caffe2 shows a higher affiliation rate among private companies than among public companies.
When we consider Education Institution affliation, Pytorch and Caffe2 show the highest rates of identification. Pytorch in particular has the most double affiliations (institutions had a commit in both January and February, so were counted twice).
Preliminary Conclusions
- Educational Institutions appear to be more engaged with Caffe2 and Pytorch than with Tensorflow.
- Besides its own CNTK project, Microsoft only has affiliated activity on Pytorch, not on Tensorflow.
- Besides its own projects, Tensorflow and Keras, Google also has affilated activity in Scikit-learn and Deeplearnjs.
- IBM has affiliated activity with PyTorch and Tensorflow.
Rhetorical Questions
- Why is Pytorch appealing to a broader set of authors than Tensorflow?
- What are educational institutions doing with Caffe2?
Caveats
- Affiliation identification is limited by correct domain name information from Clearbit. Efforts have been made to correct more obvious errors (like IBM being considered “private” - stock ticker information fixes this).
- International companies may not be accurately identified. For example, educational institutions often use the “ac” suffix. While this has been corrected, there may be other cases of misidentification.
- Clearbit did not return domain information for all of the domains. A manual check of a random sample indicated these domains did not resolve and no information was available.
- Github has been called out as its own type because Github email addresses are either used for commits made using their UI, automation tools, or authors wishing to obscure their actual email address.
- This analysis only looked at authors, not committers.
Commit Interval Metric Applied to Selected Deep Learning Repositories
COG Advocacy
Pattern Page Views
By itself, the Github traffic data doesn’t really tell us much. I’m sharing it because Github only makes it available for 14 days at a time. Future reports will explore this area more by comparing Wrike activity and Github events.
Future Questions
- What advocacy activities generated the most interest in our code patterns?
- How do views and clones compare with other Github activities?
- How effective are traffic and clones for indicating developer interest?
Views of Open Source Cognitive team patterns on Github in February 2018
Views of Spark team patterns on Github in February 2018
Pattern Clones
Clones of Open Source Cognitive team patterns on Github in February 2018
Clones of Spark team patterns on Github in February 2018
Next Month
In addition to improved versions of this month’s analysis, next month’s report may include the following:
- Committers in addition to authors in the commit log
- Github Events in addition to the Commit Log
- Github events in addition to traffic data
- Wrike activity in relation to the traffic data
- Commit trends for top organizations on Github
Is there something you’d like to see next month? Submit a Github Issue to the As The COG Turns repo! (Requires IBM Login)
Some domain aliases were not fully consolidated so some affiliations may have been missed. A future version of this report will fix this.↩