2025-11-09

Title

Predicting Software Defects with Linear Regression

Motivation

Short-cycle releases and large code bases make predicting defects important. In this lesson we show how to:

  • fit a linear model to predict defect counts from code metrics (LOC, complexity)
  • visualize results with plots
  • check assumptions using residual plots

Dataset for Demo

   LOC Complexity   Defects
1  632         41  9.628756
2  731         29  9.193603
3 1268         34 18.236121
4  821         37  9.843393
5  839         62 15.653897
6 1315         32 19.124724

Single Predictor Model (LaTex)

\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \varepsilon \]

Multiple Predictor Model (LaTex)

\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \beta_2 \cdot Complexity + \varepsilon \]

Summary of Model

Model 1: Defects ~ LOC
term estimate std.error statistic p.value
(Intercept) 5.475 0.729 7.513 0
LOC 0.009 0.001 10.248 0
Model 2: Defects ~ LOC + Complexity
term estimate std.error statistic p.value
(Intercept) 1.141 0.825 1.383 0.169
LOC 0.008 0.001 11.389 0.000
Complexity 0.124 0.016 7.630 0.000

Scatterplot of Defects vs LOC with Regression Line

Scatterplot of Defects vs Cyclomatic Complexity

3D Scatterplot: LOC vs Complexity vs Defects

Residuals vs Fitted Plot

r code for Residuals vs Fitted Plot

df$resid_mod2 = residuals(mod2)
df$fitted_mod2 = fitted(mod2)

p3 = ggplot(df, aes(x = fitted_mod2, y = resid_mod2)) +
  geom_point(alpha = 0.6) + 
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (mod2)", 
      x = "Fitted Values",
      y = "Residuals")
print(p3)