TCC

Understanding the testing culture of machine learning projects on Github

This site serves as a register for all the analysis that we made in the context of the final project, and will contain all the questions and explorations.

To start, our dataset is composed of 294 projects located on Github, all filtered from a curated list maintained here. For time and scope reasons, we are going to focus only on Python code, given that more than 70% of ML developers and data scientists report using Python.

In case of multi-language projects, we selected those that have at least 20% of Python code.

Exploratory Data Analysis

First, we are going to explore the repositories’ category distribution. After that, we will show some of the projects’ github metadata. And at last, we will see some data points collected with Sonarqube (a tool for static analysis).

The distribution is as it follows:

To clear any misunderstandings:

  • General Purpose Machine Learning: Deals with general purpose projects, this means projects with more generalization power, that can be used in multiple occasions;
  • Data Analysis / Data Visualization: As the name suggests, projects that focus in data analysis and visualization;
  • Misc Scripts / iPython Notebooks / Codebases: Miscellaneous scripts / jupyter notebooks (interactive format) / simple codebases;
  • Kaggle Competition Source Code: Projects related to problems proposed on the kaggle platform.
  • Natural Language Processing: As the name suggest, projects that work in the natural language processing context;
  • Computer Vision: Repositories related to tasks of image apprehension / comprehension / processing with the goal of creating understanding;
  • Reinforcement learning: Projects related to the task of computer learning by reinforcing behaviors;
  • Neural networks: Deals with repositories based on the model of neural networks.

Github Metadata

Estrelas

Categoria Frequência Média Mediana 99th Desvio padrão
Data Analysis / Data Visualization 48 6356.54 2504.5 42803.54 9578.38
Computer Vision 24 6530.04 2240.0 41534.81 10948.41
Reinforcement Learning 11 6914.91 2202.0 28332.10 9696.79
General-Purpose Machine Learning 121 6336.98 1446.0 59413.60 17827.79
Natural Language Processing 38 3941.37 993.5 27653.84 6614.71
Neural Networks 8 2869.38 224.0 15782.89 5820.61
Misc Scripts / iPython Notebooks / Codebases 21 1791.19 85.0 20565.40 5322.31
Kaggle Competition Source Code 22 97.00 50.5 469.77 125.19

As we can see, reinforcement learning, computer vision, and data analysis have a median greater than 2200 stars. And that makes them the most popular. Meanwhile, as expected, half of the kaggle projects present less than 50 stars. Probably because of the smaller power of generalization and applicability.


Commits

Categoria Frequência Média Mediana 99th Desvio padrão
Data Analysis / Data Visualization 48 7852.35 2557.5 48839.89 12595.85
Reinforcement Learning 11 2231.64 947.0 13671.40 4264.06
General-Purpose Machine Learning 121 4494.14 844.0 48933.40 14260.06
Computer Vision 24 1009.88 280.5 10863.23 2687.49
Natural Language Processing 38 2237.45 272.0 25186.59 5832.54
Misc Scripts / iPython Notebooks / Codebases 21 513.29 92.0 5485.00 1422.51
Neural Networks 8 1066.25 53.5 7448.10 2799.45
Kaggle Competition Source Code 22 35.91 13.0 154.81 44.63

As for the commits, we see more contributions in the data analysis and reinforcement learning. While the kaggle projects have the least contributions - half not even reaching 13. That makes us think that they are punctual and for the short term.


Forks

Existem 2 repositórios com valor 0.
Categoria Frequência Média Mediana 99th Desvio padrão
Computer Vision 24 1585.25 648.0 11099.61 2875.10
Reinforcement Learning 11 1497.82 441.0 7639.00 2452.52
Data Analysis / Data Visualization 48 1642.65 394.0 12704.28 2935.00
General-Purpose Machine Learning 121 1956.71 265.0 22742.60 8504.22
Natural Language Processing 38 767.58 241.0 5704.69 1394.51
Neural Networks 8 438.50 44.5 1935.76 775.27
Misc Scripts / iPython Notebooks / Codebases 21 564.86 30.0 6279.20 1642.84
Kaggle Competition Source Code 22 44.00 27.0 181.48 48.55

In relation to forks, the categories highlighted are computer vision and reinforcemente learning. So we conclude that they present more variations.


Issues Abertas

Existem 45 repositórios com valor 0.
Categoria Frequência Média Mediana 99th Desvio padrão
Data Analysis / Data Visualization 48 579.54 172.5 4131.06 982.91
General-Purpose Machine Learning 121 250.09 42.0 2327.80 1008.35
Reinforcement Learning 11 307.27 38.0 2434.90 790.63
Computer Vision 24 128.67 32.5 675.72 195.68
Natural Language Processing 38 85.76 29.5 692.58 154.69
Neural Networks 8 11.75 6.0 37.44 14.56
Misc Scripts / iPython Notebooks / Codebases 21 38.38 1.0 406.60 105.67
Kaggle Competition Source Code 22 1.32 0.0 9.58 2.83

At last, we have more open issues in data analysis, general purpose and natural language processing projects. It’s important to make clear, that the number in the data analysis context is much higher than the other contexts, than can be happening because there are more activity. And this theory agrees with the number of commits, that are also higher.


Test adoption in ML projects

Which automated testing tools are being used for ML ?

We found that several categories adopt different tools in varying degrees. More specifically, we identified 35 tools being used in practice related to the testing activity. Some of them being testing frameworks, like unittest, which already comes built-in in Python, and Pytest, which is the most used third-party framework for testing purposes. Nose is also used, but its use is mostly seen in legacy code, given that the library is not actively maintained anymore.

Meanwhile, other packages have more specific purposes, like behavior mocking (e.g. request_mock, pytest_mock, fakeredis, mongomock, etc), property-based testing (e.g. hypothesis), auxiliary functions (e.g. test_utils, test_helper), even executable tests on comments (e.g. doctest, examples).

In quality terms, what does the existence of tests says about the projects ?

Characteristics With automated tests (N = 199) Without automated tests (N = 94) Percentual difference
Lines of code (LOC)
Minimum 51.0000000 26.0000000 96.15%
Mean 61626.4254144 3308.5000000 1762.67%
Median 13193.0000000 1415.0000000 832.37%
Maximum 2235379.0000000 24290.0000000 9102.88%
Cognitive Complexity (CC)
Minimum 2.0000000 2.0000000 0%
Mean 8600.8618785 747.5882353 1050.48%
Median 1686.0000000 303.0000000 456.44%
Maximum 413988.0000000 9237.0000000 4381.84%
Duplicated lines (%)
Minimum 0.0000000 0.0000000 undefined*
Mean 5.6845304 6.2764706 -9.43%
Median 2.7000000 0.6500000 315.38%
Maximum 69.7000000 76.5000000 -8.89%
Bugs (normalized by CC)
Minimum 0.0000000 0.0000000 undefined*
Mean 0.0046534 0.0023059 101.8%
Median 0.0021529 0.0000000 undefined*
Maximum 0.0517879 0.0314465 64.69%
Reliability Remediation Effort in minutes (normalized by CC)
Minimum 0.0000000 0.0000000 undefined*
Mean 0.0293257 0.0149354 96.35%
Median 0.0118110 0.0000000 undefined*
Maximum 0.3628741 0.1509434 140.4%
Smells (normalized by CC)
Minimum 0.0287356 0.0000000 undefined*
Mean 0.2382076 0.2754192 -13.51%
Median 0.1421053 0.1992780 -28.69%
Maximum 3.5000000 1.0000000 250%
Technical Debt in minutes (normalized by CC)
Minimum 0.1609195 0.0000000 undefined*
Mean 1.3037221 1.4150224 -7.87%
Median 0.9432810 1.3280922 -28.97%
Maximum 11.1098097 3.7192623 198.71%
Vulnerabilities (normalized by CC)
Minimum 0.0000000 0.0000000 undefined*
Mean 0.0000660 0.0001661 -60.27%
Median 0.0000000 0.0000000 undefined*
Maximum 0.0030958 0.0058824 -47.37%
* The difference cannot be determined.

On the table X we can see some statistics about size, complexity and quality derived from a static analysis on Sonarqube. We found that, our projects that use at least one of the mechanisms that we identified have higher cognitive complexity (456.44% in median), meaning that they are harder to understand overall. That complexity is also accompanied by larger code bases in term of lines of code (832.37% in median).

In terms of quality, we saw, in average, more duplicated lines in projects without the test mechanisms (-9.43%). But as the median of projects with tests is higher (315.38%), we can attribute that to outliers in the other projects. As for the normalized number of bugs and reliability remediation effort (time estimate of time to solve those bugs), we saw consistently more bugs and effort by complexity in projects with tests, probably because of their size.

Meanwhile, the normalized number of code smells and technical debt (time estimate to solve the code smells) that we saw was the opposite. Here, we have less smells (-28.69% in median) and debt (-28.97% in median) by complexity. This also holds true for vulnerabilities, where projects with some test mechanism have a smaller of occurrences (-60.27% in average) overall.

are

characteristics from