Introduction

Numerous software attribute measurements have been created and researched in an effort to describe the underlying structure of computer programs. It has been demonstrated that several of these indicators are connected to real data on program quality. However, most measures of software quality are often created over a longer period of time. Any size of program requires years of operation before latent flaws are discovered [1].

In recent studies, metrics of software complexity have been demonstrated to be highly correlated with how defects are distributed among computer programs. There is a connection between some complexity measures and the quantity of modifications attributable to flaws subsequently discovered in test and validation [2].

In this study, the researchers investigated the correlation between a metric for software complexity, namely the number of source lines of code in projects, and the number of bug fixes that had to be accomplished. This was performed in order to determine the viability of code length as a predictor for the quantity of flaws that affect code quality of programming projects. Investigating such correlation contributes to the foundation of research on the development of predictors of faults in specific lines of code [3] and even overall software projects [4].


Methodology

The dataset was obtained through a similar study committed by Kocchar, et. al, entitled A Large Scale Study of Multiple Programming Languages and Code Quality [5]. The study aims to observe how using different programming languages compare to programs in only one in terms of quality of the program created with said language. In the study, the authors obtained the data by retrieving the code and GitHub Archive of the top 50 programs for each of the top 17 programming languages. For this specific analysis, the focus will be on the program size and it’s relationship with bug fixes and errors, with the programming language also in consideration.

To attain the data in the previously mentioned study, the authors made use of the GitHub Archives of the projects, and for the variables crucial to this analysis, the authors observed the commit logs where the bug fixes are visible, while to obtain the program size, calculations were made using the total number of lines of code.

Given the data that the study will be focusing on, the group will be performing Linear Regression and Correlation Analysis. The relationship between the two variables will be determined, which means the hypotheses are the following:

Ho: β1 = 0 (null)
H1: β1 ≠ 0 (alternative)

To perform the analysis of the two variables, the free software Jamovi was used. This software allows the user to input their variables and data and perform statistical analysis on them automatically, such as T-Tests, ANOVA, and most importantly, Linear Regression.

Figure 1. Interface of Jamovi.


Like mentioned previously, the first step is to input the independent and dependent variables.

Table 1. Data inputted into Jamovi.



In the case of the study, the independent variable is the SLOC or Source Lines of Code and the dependent variable is the Number of Bug Fixes that occur. The Coding Language column serves as labels and were the sources of each of the data points in the original data set. The next step is to click the ‘Regression’ button which can be seen in Figure 1. From here, a dropdown menu will appear.

Figure 2. Dropdown menu after clicking Regression


The first step is to press ‘Linear Regression’. This will take the user to a section where they can select which of the variables they placed is the independent and dependent variable.


Figure 3. Selection of variables.


Simply select the variable in the first menu and click the corresponding arrow to which variable it should be. Independent variables go into “Covariates” and the dependent variable goes into its corresponding box. Click the grey arrow at the top right of Figure 3, to which the program calculates for the corresponding values and displays them in a table. To create a scatter plot, click “Exploration” seen in Figure 1. From there, click “Scatter Plot”.


Figure 4. Interface for Scatter plot


Input the corresponding values into the respective axes (independent on X, dependent on Y). This will generate the scatter plot. To add the regression line, simply click “Linear” under ‘Regression Line’, which is on None by default. To see labels for specific data points (but remove the regression line), input “Coding Language” into Group.

All generated tables and graphs will be shown on the right hand side of the interface shown in Figure 1. (replacing the Jamovi logo and version number)


Results

The ‘Linear Regression’ table generated with the data gathered is as shown:

Table 2. Linear Regression table.


The data that is important from this table is the r (correlation coefficient) and the p-value. All of this is visualized in the scatter plot that was generated.


Figure 5. Scatter plot of two variables with regression line.


Figure 6. Labels for each data point. The top right data point refers to C, while the bottom right data point refers to TypeScript.


Despite C and TypeScript having almost the same source lines of code (SLOC) yet having a significant difference in the number of bug fixes, the scatterplot shown in the figures above displays an upward trend between the two variables.

To determine the strength of this relationship, the correlation coefficient, r, is calculated. Table 2 shows the regression table associated with the scatterplot, and it is found that the data set has a correlation coefficient of 0.666 — a value close to 1. Hence, the source lines of code and the number of bug fixes have a strong linear positive correlation.

It can therefore be concluded from the values obtained and the figures presented that there is a direct linear relationship between the source lines of code and the number of bug fixes. In other words, as the SLOC increases, the number of bug fixes increases relatively.


Discussion

The dataset used in this paper is lifted from the study of Kochhar et al., entitled A Large Scale Study of Multiple Programming Languages and Code Quality [5]. The parameters of interest are focused on the source lines of code (SLOC) and the number of bug fixes as the independent variable and dependent variable, respectively. Two hypotheses were tested to determine whether they were statistically significant. The null hypothesis is given by Ho: β1 = 0 while the alternative hypothesis is denoted by H1: β1 ≠ 0.

The analysis of the dataset is done with the help of software. The calculated correlation coefficient (R) is equal to 0.666, while the squared correlation coefficient (R2) is roughly equal to 0.444. Since R is near the value of 1, this suggests that there is a substantial correlation or direct relationship between the number of bug fixes and the SLOC. Moreover, the bug commits account for 44.4% of the variation in the source lines of code.

The standard error slope (SE) is 2.81. The data also revealed that the probability of observing the test statistic (t = 2.53) is 0.036. Furthermore, the dataset’s p-value is 0.036, which is less than the level of significance (a= 0.05).


Conclusion and Recommendations

The data that was gathered is enough evidence to reject the null hypothesis. Ultimately, the source lines of code are statistically significant for the number of bug fixes in the programming languages, and therefore they have a relationship with one another.

In light of this, the group recommends that future researchers inspect the correlation of other variables and delve deeper into the factors that influence the code quality of programming projects. Living in the 21st century, it is crucial for future developers to assess the risks posed by bugs in implementing software projects. It would be beneficial to investigate additional statistically significant variables for the quality of programming projects in order to reduce the likelihood of errors and other quality problems when creating intricate software projects.