DAT-301_-HW3.knit

2025-11-09

Title

Predicting Software Defects with Linear Regression

Motivation

Short-cycle releases and large code bases make predicting defects important. In this lesson we show how to:

fit a linear model to predict defect counts from code metrics (LOC, complexity)
visualize results with plots
check assumptions using residual plots

Dataset for Demo

   LOC Complexity   Defects
1  632         41  9.628756
2  731         29  9.193603
3 1268         34 18.236121
4  821         37  9.843393
5  839         62 15.653897
6 1315         32 19.124724

Single Predictor Model (LaTex)

\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \varepsilon \]

Multiple Predictor Model (LaTex)

\[ \hat{Defects} = \beta_0 + \beta_1 \cdot LOC + \beta_2 \cdot Complexity + \varepsilon \]

Summary of Model

Model 1: Defects ~ LOC
term	estimate	std.error	statistic	p.value
(Intercept)	5.475	0.729	7.513	0
LOC	0.009	0.001	10.248	0

Model 2: Defects ~ LOC + Complexity
term	estimate	std.error	statistic	p.value
(Intercept)	1.141	0.825	1.383	0.169
LOC	0.008	0.001	11.389	0.000
Complexity	0.124	0.016	7.630	0.000

Scatterplot of Defects vs LOC with Regression Line

Scatterplot of Defects vs Cyclomatic Complexity

3D Scatterplot: LOC vs Complexity vs Defects

Residuals vs Fitted Plot

r code for Residuals vs Fitted Plot

df$resid_mod2 = residuals(mod2)
df$fitted_mod2 = fitted(mod2)

p3 = ggplot(df, aes(x = fitted_mod2, y = resid_mod2)) +
  geom_point(alpha = 0.6) + 
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (mod2)", 
      x = "Fitted Values",
      y = "Residuals")
print(p3)