Numerous software attribute measurements have been created and
researched in an effort to describe the underlying structure of computer
programs. It has been demonstrated that several of these indicators are
connected to real data on program quality. However, most measures of
software quality are often created over a longer period of time. Any
size of program requires years of operation before latent flaws are
discovered [1].
In recent studies, metrics of software
complexity have been demonstrated to be highly correlated with how
defects are distributed among computer programs. There is a connection
between some complexity measures and the quantity of modifications
attributable to flaws subsequently discovered in test and validation
[2].
In this study, the researchers investigated the
correlation between a metric for software complexity, namely the number
of source lines of code in projects, and the number of bug fixes that
had to be accomplished. This was performed in order to determine the
viability of code length as a predictor for the quantity of flaws that
affect code quality of programming projects. Investigating such
correlation contributes to the foundation of research on the development
of predictors of faults in specific lines of code [3] and even overall
software projects [4].
The dataset was obtained through a similar study committed by Kocchar, et. al, entitled A Large Scale Study of Multiple Programming Languages and Code Quality [5]. The study aims to observe how using different programming languages compare to programs in only one in terms of quality of the program created with said language. In the study, the authors obtained the data by retrieving the code and GitHub Archive of the top 50 programs for each of the top 17 programming languages. For this specific analysis, the focus will be on the program size and it’s relationship with bug fixes and errors, with the programming language also in consideration.
To attain the data in the previously mentioned study, the authors made use of the GitHub Archives of the projects, and for the variables crucial to this analysis, the authors observed the commit logs where the bug fixes are visible, while to obtain the program size, calculations were made using the total number of lines of code.
Given the data that the study will be focusing on, the group will be performing Linear Regression and Correlation Analysis. The relationship between the two variables will be determined, which means the hypotheses are the following:Ho: β1 = 0 (null)
H1: β1 ≠ 0 (alternative)
Figure 1. Interface of Jamovi.
Table 1. Data inputted into Jamovi.
Figure 2. Dropdown menu after clicking Regression
The first step is to press ‘Linear Regression’. This will take
the user to a section where they can select which of the variables they
placed is the independent and dependent variable.
Figure 3. Selection of variables.
Simply select the variable in the first menu and click the
corresponding arrow to which variable it should be. Independent
variables go into “Covariates” and the dependent variable goes into its
corresponding box. Click the grey arrow at the top right of Figure 3, to
which the program calculates for the corresponding values and displays
them in a table. To create a scatter plot, click “Exploration” seen in
Figure 1. From there, click “Scatter Plot”.
Figure 4. Interface for Scatter plot
Input the corresponding values into the respective axes
(independent on X, dependent on Y). This will generate the scatter plot.
To add the regression line, simply click “Linear” under ‘Regression
Line’, which is on None by default. To see labels for specific data
points (but remove the regression line), input “Coding Language” into
Group.
All generated tables and graphs will be shown on the right hand side of the interface shown in Figure 1. (replacing the Jamovi logo and version number)
The ‘Linear Regression’ table generated with the data gathered is as
shown:
Table 2. Linear Regression table.
The data that is important from
this table is the r (correlation coefficient) and the p-value. All of
this is visualized in the scatter plot that was generated.
Figure 5. Scatter plot of two variables with regression line.
Figure 6. Labels for each data point. The top right data point refers to C, while the bottom right data point refers to TypeScript.
Despite C and TypeScript having almost the same source lines of
code (SLOC) yet having a significant difference in the number of bug
fixes, the scatterplot shown in the figures above displays an upward
trend between the two variables.
To determine the strength of
this relationship, the correlation coefficient, r, is calculated. Table
2 shows the regression table associated with the scatterplot, and it is
found that the data set has a correlation coefficient of 0.666 — a value
close to 1. Hence, the source lines of code and the number of bug fixes
have a strong linear positive correlation.
It can therefore be
concluded from the values obtained and the figures presented that there
is a direct linear relationship between the source lines of code and the
number of bug fixes. In other words, as the SLOC increases, the number
of bug fixes increases relatively.
The dataset used in this paper is lifted from the study of Kochhar et
al., entitled A Large Scale Study of Multiple Programming Languages
and Code Quality [5]. The parameters of interest are focused on the
source lines of code (SLOC) and the number of bug fixes as the
independent variable and dependent variable, respectively. Two
hypotheses were tested to determine whether they were statistically
significant. The null hypothesis is given by Ho: β1 = 0 while the
alternative hypothesis is denoted by H1: β1 ≠ 0.
The analysis
of the dataset is done with the help of software. The calculated
correlation coefficient (R) is equal to 0.666, while the squared
correlation coefficient (R2) is roughly equal to 0.444. Since R is near
the value of 1, this suggests that there is a substantial correlation or
direct relationship between the number of bug fixes and the SLOC.
Moreover, the bug commits account for 44.4% of the variation in the
source lines of code.
The standard error slope (SE) is 2.81.
The data also revealed that the probability of observing the test
statistic (t = 2.53) is 0.036. Furthermore, the dataset’s
p-value is 0.036, which is less than the level of significance
(a= 0.05).
The data that was gathered is enough evidence to reject the
null hypothesis. Ultimately, the source lines of code are
statistically significant for the number of bug fixes in the programming
languages, and therefore they have a relationship with one another.