TCC

Understanding the testing culture of machine learning projects on Github

This site serves as a register for all the analysis that we made in the context of the final project, and will contain all the questions and explorations.

To start, our dataset is composed of 294 projects located on Github, all filtered from a curated list maintained here. For time and scope reasons, we are going to focus only on Python code, given that more than 70% of ML developers and data scientists report using Python.

In case of multi-language projects, we selected those that have at least 20% of Python code.

Exploratory Data Analysis

First, we are going to explore the repositories’ category distribution. After that, we will show some of the projects’ github metadata. And at last, we will see some data points collected with Sonarqube (a tool for static analysis).

The distribution is as it follows:

To clear any misunderstandings:

General Purpose Machine Learning: Deals with general purpose projects, this means projects with more generalization power, that can be used in multiple occasions;
Data Analysis / Data Visualization: As the name suggests, projects that focus in data analysis and visualization;
Misc Scripts / iPython Notebooks / Codebases: Miscellaneous scripts / jupyter notebooks (interactive format) / simple codebases;
Kaggle Competition Source Code: Projects related to problems proposed on the kaggle platform.
Natural Language Processing: As the name suggest, projects that work in the natural language processing context;
Computer Vision: Repositories related to tasks of image apprehension / comprehension / processing with the goal of creating understanding;
Reinforcement learning: Projects related to the task of computer learning by reinforcing behaviors;
Neural networks: Deals with repositories based on the model of neural networks.

Github Metadata

Estrelas

Categoria	Frequência	Média	Mediana	99th	Desvio padrão
Data Analysis / Data Visualization	48	6356.54	2504.5	42803.54	9578.38
Computer Vision	24	6530.04	2240.0	41534.81	10948.41
Reinforcement Learning	11	6914.91	2202.0	28332.10	9696.79
General-Purpose Machine Learning	121	6336.98	1446.0	59413.60	17827.79
Natural Language Processing	38	3941.37	993.5	27653.84	6614.71
Neural Networks	8	2869.38	224.0	15782.89	5820.61
Misc Scripts / iPython Notebooks / Codebases	21	1791.19	85.0	20565.40	5322.31
Kaggle Competition Source Code	22	97.00	50.5	469.77	125.19

As we can see, reinforcement learning, computer vision, and data analysis have a median greater than 2200 stars. And that makes them the most popular. Meanwhile, as expected, half of the kaggle projects present less than 50 stars. Probably because of the smaller power of generalization and applicability.

Commits

Categoria	Frequência	Média	Mediana	99th	Desvio padrão
Data Analysis / Data Visualization	48	7852.35	2557.5	48839.89	12595.85
Reinforcement Learning	11	2231.64	947.0	13671.40	4264.06
General-Purpose Machine Learning	121	4494.14	844.0	48933.40	14260.06
Computer Vision	24	1009.88	280.5	10863.23	2687.49
Natural Language Processing	38	2237.45	272.0	25186.59	5832.54
Misc Scripts / iPython Notebooks / Codebases	21	513.29	92.0	5485.00	1422.51
Neural Networks	8	1066.25	53.5	7448.10	2799.45
Kaggle Competition Source Code	22	35.91	13.0	154.81	44.63

As for the commits, we see more contributions in the data analysis and reinforcement learning. While the kaggle projects have the least contributions - half not even reaching 13. That makes us think that they are punctual and for the short term.

Forks

Existem 2 repositórios com valor 0.

Categoria	Frequência	Média	Mediana	99th	Desvio padrão
Computer Vision	24	1585.25	648.0	11099.61	2875.10
Reinforcement Learning	11	1497.82	441.0	7639.00	2452.52
Data Analysis / Data Visualization	48	1642.65	394.0	12704.28	2935.00
General-Purpose Machine Learning	121	1956.71	265.0	22742.60	8504.22
Natural Language Processing	38	767.58	241.0	5704.69	1394.51
Neural Networks	8	438.50	44.5	1935.76	775.27
Misc Scripts / iPython Notebooks / Codebases	21	564.86	30.0	6279.20	1642.84
Kaggle Competition Source Code	22	44.00	27.0	181.48	48.55

In relation to forks, the categories highlighted are computer vision and reinforcemente learning. So we conclude that they present more variations.

Issues Abertas

Existem 45 repositórios com valor 0.

Categoria	Frequência	Média	Mediana	99th	Desvio padrão
Data Analysis / Data Visualization	48	579.54	172.5	4131.06	982.91
General-Purpose Machine Learning	121	250.09	42.0	2327.80	1008.35
Reinforcement Learning	11	307.27	38.0	2434.90	790.63
Computer Vision	24	128.67	32.5	675.72	195.68
Natural Language Processing	38	85.76	29.5	692.58	154.69
Neural Networks	8	11.75	6.0	37.44	14.56
Misc Scripts / iPython Notebooks / Codebases	21	38.38	1.0	406.60	105.67
Kaggle Competition Source Code	22	1.32	0.0	9.58	2.83

At last, we have more open issues in data analysis, general purpose and natural language processing projects. It’s important to make clear, that the number in the data analysis context is much higher than the other contexts, than can be happening because there are more activity. And this theory agrees with the number of commits, that are also higher.

Test adoption in ML projects

Which automated testing tools are being used for ML ?

We found that several categories adopt different tools in varying degrees. More specifically, we identified 35 tools being used in practice related to the testing activity. Some of them being testing frameworks, like unittest, which already comes built-in in Python, and Pytest, which is the most used third-party framework for testing purposes. Nose is also used, but its use is mostly seen in legacy code, given that the library is not actively maintained anymore.

Meanwhile, other packages have more specific purposes, like behavior mocking (e.g. request_mock, pytest_mock, fakeredis, mongomock, etc), property-based testing (e.g. hypothesis), auxiliary functions (e.g. test_utils, test_helper), even executable tests on comments (e.g. doctest, examples).

In quality terms, what does the existence of tests says about the projects ?

Characteristics	With automated tests (N = 199)	Without automated tests (N = 94)	Percentual difference
Lines of code (LOC)
Minimum	51.0000000	26.0000000	96.15%
Mean	61626.4254144	3308.5000000	1762.67%
Median	13193.0000000	1415.0000000	832.37%
Maximum	2235379.0000000	24290.0000000	9102.88%
Cognitive Complexity (CC)
Minimum	2.0000000	2.0000000	0%
Mean	8600.8618785	747.5882353	1050.48%
Median	1686.0000000	303.0000000	456.44%
Maximum	413988.0000000	9237.0000000	4381.84%
Duplicated lines (%)
Minimum	0.0000000	0.0000000	undefined*
Mean	5.6845304	6.2764706	-9.43%
Median	2.7000000	0.6500000	315.38%
Maximum	69.7000000	76.5000000	-8.89%
Bugs (normalized by CC)
Minimum	0.0000000	0.0000000	undefined*
Mean	0.0046534	0.0023059	101.8%
Median	0.0021529	0.0000000	undefined*
Maximum	0.0517879	0.0314465	64.69%
Reliability Remediation Effort in minutes (normalized by CC)
Minimum	0.0000000	0.0000000	undefined*
Mean	0.0293257	0.0149354	96.35%
Median	0.0118110	0.0000000	undefined*
Maximum	0.3628741	0.1509434	140.4%
Smells (normalized by CC)
Minimum	0.0287356	0.0000000	undefined*
Mean	0.2382076	0.2754192	-13.51%
Median	0.1421053	0.1992780	-28.69%
Maximum	3.5000000	1.0000000	250%
Technical Debt in minutes (normalized by CC)
Minimum	0.1609195	0.0000000	undefined*
Mean	1.3037221	1.4150224	-7.87%
Median	0.9432810	1.3280922	-28.97%
Maximum	11.1098097	3.7192623	198.71%
Vulnerabilities (normalized by CC)
Minimum	0.0000000	0.0000000	undefined*
Mean	0.0000660	0.0001661	-60.27%
Median	0.0000000	0.0000000	undefined*
Maximum	0.0030958	0.0058824	-47.37%
^* The difference cannot be determined.

On the table X we can see some statistics about size, complexity and quality derived from a static analysis on Sonarqube. We found that, our projects that use at least one of the mechanisms that we identified have higher cognitive complexity (456.44% in median), meaning that they are harder to understand overall. That complexity is also accompanied by larger code bases in term of lines of code (832.37% in median).

In terms of quality, we saw, in average, more duplicated lines in projects without the test mechanisms (-9.43%). But as the median of projects with tests is higher (315.38%), we can attribute that to outliers in the other projects. As for the normalized number of bugs and reliability remediation effort (time estimate of time to solve those bugs), we saw consistently more bugs and effort by complexity in projects with tests, probably because of their size.

Meanwhile, the normalized number of code smells and technical debt (time estimate to solve the code smells) that we saw was the opposite. Here, we have less smells (-28.69% in median) and debt (-28.97% in median) by complexity. This also holds true for vulnerabilities, where projects with some test mechanism have a smaller of occurrences (-60.27% in average) overall.

are

characteristics from