2025-03-15

Introduction

In software development, predicting defects is essential for improving quality and reducing maintenance costs. This presentation demonstrates how to use logistic regression to estimate the probability of a module being defective based on metrics like Lines of Code (LOC) and Complexity.

Logistic Regression Theory

Logistic regression models the probability of a binary outcome. The model is defined as:

\[ P(Y=1 \mid X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]

Here, Y = 1 indicates that a software module is defective, and X represents a software metric (for example, LOC).

Logit Transformation and Odds Ratio

To linearize the relationship, I used the logit transformation:

\[ \text{logit}(P) = \log\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X \]

The odds ratio, which quantifies the change in odds for a one-unit increase in X, is given by:

\[ \text{Odds Ratio} = e^{\beta_1} \]

Data Simulation and R Code

Below is the R code to simulate data for 100 software modules with two predictors: Lines of Code (LOC) and Complexity. The defect status is generated using a logistic model.

##        LOC Complexity defect
## 1 443.9524   7.868780      0
## 2 476.9823  10.770651      1
## 3 655.8708   9.259924      1
## 4 507.0508   8.957372      1
## 5 512.9288   7.145144      1
## 6 671.5065   9.864917      1

Scatter Plot

The following code creates a scatter plot of LOC vs. defect status using ggplot2. It uses jitter to better display the binary outcome and color-codes points by Complexity.

Fitted Logistic Regression Curve

I fit a logistic regression model using both LOC and Complexity. For visualization, I fixed Complexity at its mean and plot the predicted probability of defect versus LOC.

Plotly Interactive 3D Plot

The following code creates an interactive 3D plot showing the relationship among LOC, Complexity, and the predicted defect probability.

Conclusion

Logistic regression is an effective tool for predicting software defects using key metrics such as LOC and Complexity. This presentation demonstrates data simulation, model fitting, and visualization using ggplot2 and plotly.