Predicting Software Defects with Linear Regression
2025-11-09
Predicting Software Defects with Linear Regression
Short-cycle releases and large code bases make predicting defects important. In this lesson we show how to:
LOC Complexity Defects 1 632 41 9.628756 2 731 29 9.193603 3 1268 34 18.236121 4 821 37 9.843393 5 839 62 15.653897 6 1315 32 19.124724
\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \varepsilon \]
\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \beta_2 \cdot Complexity + \varepsilon \]
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.475 | 0.729 | 7.513 | 0 |
| LOC | 0.009 | 0.001 | 10.248 | 0 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1.141 | 0.825 | 1.383 | 0.169 |
| LOC | 0.008 | 0.001 | 11.389 | 0.000 |
| Complexity | 0.124 | 0.016 | 7.630 | 0.000 |
df$resid_mod2 = residuals(mod2)
df$fitted_mod2 = fitted(mod2)
p3 = ggplot(df, aes(x = fitted_mod2, y = resid_mod2)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted (mod2)",
x = "Fitted Values",
y = "Residuals")
print(p3)