Measuring and ranking the free software developers in a particular geographical space is a way of knowing the existing community and also allows assessing the impact of certain policies in the dynamics of such a community. Besides, it is interesting to try and find out why there are differences from one place to the next and how these differences evolve with time. In this paper, our main interest is to measure and rank the community of free software developers in Spain and also check its geographical distribution. This paper measures differences by province, providing a classification of provinces according to the number and type of developers present in each place.
The initial motive behind this paper was to check the health of the free software developer community in Spain. With that target in mind, we elaborated some initial national rankings which were published Merelo-Guervos et al. (2015). The main problem with those rankings is that in many cases and specially in the big cities, there was no attempt to exhaustively search all active users. Since the GitHub search API just returns 1000 results and, in the case of Madrid and Barcelona, there were way more than that, the script that downloads user data had to be modified so that, through partition of the search space, it was able to perform searches that returned less than 1000 results until all users were covered. It still does not cover users that do not declare their city/province in their profile, or use provincial towns that are not explicitly searched. In one case, Guadalajara, it was impossible to make out which users where actually from Guadalajara, Spain and not from Guadalajara, Mexico and thus was explicitly excluded. In general, it can be said that all users that declare their province or provincial capital are included, although the quantity of those that are there and do not do it is unknown, and hopefully uniform for all provinces involved.
In a previous version of this paper Merelo (2015), we had not been able to obtain all the results for the whole country, that is, users that declare their country (plus a city or town or community that is not searched otherwise, that is, excluding the provincial town). However, in this paper all users that declare Spain (in several versions) as their place of residence have been retrieved and additional analysis that involves them can be performed. This will be done next
After downloading all users, scraping was performed over the user profiles to extract the following information: number of followers, stars given by the user, stars given to the projects in which the user participates, and raw number of users. This number of users in each province is shown below. Obviously, the provinces with the biggest population do have the bigger number of users. The plot below ranks provinces by population (according to official census)
This graph does not include the non-provincial users, that, is, those whose province was not declared; besides, it does not give you an idea of the amount of users per province relative to the total number of users, which is around 10000. This is shown next
The users without a declared province form the biggest slice of the pie, with around one quarter of the total number of users. They are followed by Madrid, Barcelona, Valencia, Seville, Granada and Málaga. These provinces, by themselves, host more than half the total community of GitHub users in Spain.
Except for Valencia, Barcelona and Madrid, the rest of the provinces are not the most populated in Spain. That is why, if we take into account the population, dividing the number of GitHub users by the provincial population (as published by the National Statistics Institute), the situation is somewhat different, with Barcelona and Granada emerging as the province with the highest number of GitHub active developers per capita; they are head-to-head and the situation might change from this report to the next, that is a clear winner is not declared. After them the situation is more stable, with Madrid in the third position, followed by Seville and Saragossa (Zaragoza). The province of Córdoba is a distant next.
The situation is similar if we take into account the aggregated contributions by all declared users in the province. Once again, Granada emerges as the winner but the position of Barcelona and Madrid is inverted and a new player, Valladolid, enters the top five. Navarra, due to a single user with more than 20000 contributions, has managed to enter that ranking too.
The aggregated number of followers, that is, the sum of the users that follow every user in the province, is a bit less surprising, with Madrid and Barcelona on top, but Álava and Bilbao entering the top five. If we delve into the data this is mainly due to a single user in both cases.
The stars given to projects, which is a proxy for popularity, is correlated (but it remains to be seen exactly how) to the number of followers, with Álava again on the top 5 and two completely new provinces, Salamanca and Pontevedra, getting to the top.
Finally, if we consider the number of stars given by users, a new one, Tenerife, gets into the top 10. This would be correlated to the “social” activity of users in the province, since starts are similar to likes in other social platforms: they are issued by users to those projects they like; lots of stars basically mean that users in that province are keen on “favoriting” other projects by giving them stars.
The graphs above imply that there are different classes in the provinces in Spain. We have performed clustering using mclust (Fraley and Raftery (2002),Fraley et al. (2012)) using as representation for each province the relative values plotted above and obtained this division into four clusters.
## initial value 1.900300
## iter 5 value 1.562797
## final value 1.541968
## converged
Apparently, one cluster includes the provinces with the highest productivity and then two provinces which excel in the number of stars given to projects participated by developers there: Madrid, Barcelona, Granada and sometimes another province that is in ocassions Zaragoza, and in others Navarra. The situation in the other clusters is more fluid, with 2 or sometimes three clusters; the second one includes provinces with a certain defining feature (such as total number of stars in projects), like Álava, and finally other one or two clusters with the rest of the province.
This paper, which is updated every two or three weeks, measures and ranks Spanish provinces by the number of users and other quantities related to productivity (contributions) and popularity (stars, followers). It is a first approximation to community metrics in Spain and it is mainly intended as a reference for future use. It is also an indication of a particular point in time. Future versions will probably change this scenario and it is an interesting line of work to delve on the reason for these changes.
I am grateful to Francisco Charte for his help creating the pie chart for this paper.
Fraley, Chris, and Adrian E. Raftery. 2002. “Model-Based Clustering, Discriminant Analysis and Density Estimation.” Journal of the American Statistical Association 97: 611–31.
Fraley, Chris, Adrian E. Raftery, Thomas Brendan Murphy, and Luca Scrucca. 2012. Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report. Department of Statistics, University of Washington.
Merelo, J. J. 2015. GitHub Users in Spain: An Initial Analysis. GeNeura Team http://geneura.wordpress.com; RPubs, http://rpubs.com/jjmerelo/gh-users-spain.
Merelo-Guervos, Juan-Julian, Israel Blancas, Maribel G. Arenas, Fernando Tricas, José Antonio Vacas, and Nuria Rico. 2015. “GitHub Rankings and Its Impact on the Local Free Software Development Community.” The Winnower, January. doi:10.15200/winn.142251.14740.