K-Nearest Neighbor - Predicting Bug Covering by Theshold Voting

The goal of the study

Evaluate the prediction based on a minimal number of YES votes to consider a question to be bug-covering. For more detailed explanation, please see the previous analysis

Whe study has two goals:

Train a machine learning algorithm that predicts whether a code fragment is related to a failure or not. For that, I originally devised different metrics. The metric that will explore in the following study consists of threshold vote of YES answers.

Building the model

I chose knn.cv (cross validation) so I can minimize the risk of lucky selection of training and testing set.

Cross validations was performed by leaving one out

#build model
fitModel.cv <- knn.cv(trainingData, trainingData$bugCovering, k=3, l=0, prob = FALSE, use.all=TRUE);

I have also run with differnt levels of k=3,5,7,9, which produced similar results.

Testing the model

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  129 
## 
##  
##                          | fitModel.cv.df[, 1] 
## trainingData$bugCovering |     FALSE |      TRUE | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |       103 |         1 |       104 | 
##                          |     0.990 |     0.010 |     0.806 | 
##                          |     0.981 |     0.042 |           | 
##                          |     0.798 |     0.008 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |         2 |        23 |        25 | 
##                          |     0.080 |     0.920 |     0.194 | 
##                          |     0.019 |     0.958 |           | 
##                          |     0.016 |     0.178 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       105 |        24 |       129 | 
##                          |     0.814 |     0.186 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

Estimating the metric

Discover the minimal threshold vote value that would have predicted the same bug Covering questions

Mean threshold vote of the questions categorized as bug covering:

## [1] 9.416667

Minimal threshold vote of the questions categorized as bug covering:

## [1] 6

Plot metric distribution

By the distribution of threshold vote outcomes values, we can note that the metric value for threshold vote has to be larger or equal to 6 (six) in order predict bug-covering questions.