K-Nearest Neighbor - Predicting Bug Covering by Majority Voting

The goal of the study

Evaluate the prediction based on the difference between number of YES votes and NO votes. For more detailed explanation, please see the previous analysis

Whe study has two goals:

Train a machine learning algorithm that predicts whether a code fragment is related to a failure or not. For that, I originally devised different metrics. The metric that will explore in the following study consists of Majority vote between YES and NO answers.

Building the model

I chose knn.cv (cross validation) so I can minimize the risk of lucky selection of training and testing set.

Cross validations was performed by leaving one out

#build model
fitModel.cv <- knn.cv(trainingData, trainingData$bugCovering, k=3, l=0, prob = FALSE, use.all=TRUE);

I have also run with differnt levels of k=3,5,7,9, which produced similar results.

Testing the model

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  129 
## 
##  
##                          | fitModel.cv.df[, 1] 
## trainingData$bugCovering |     FALSE |      TRUE | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                    FALSE |        97 |         7 |       104 | 
##                          |     0.933 |     0.067 |     0.806 | 
##                          |     0.915 |     0.304 |           | 
##                          |     0.752 |     0.054 |           | 
## -------------------------|-----------|-----------|-----------|
##                     TRUE |         9 |        16 |        25 | 
##                          |     0.360 |     0.640 |     0.194 | 
##                          |     0.085 |     0.696 |           | 
##                          |     0.070 |     0.124 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |       106 |        23 |       129 | 
##                          |     0.822 |     0.178 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

Estimating the metric

Discover the minimal majority vote value that would have predicted the same bug Covering questions

Mean majority vote of the questions categorized as bug covering:

## [1] 3.130435

Minimal majority vote of the questions categorized as bug covering:

## [1] -2

Plot metric distribution

By the distribution of majority vote outcomes values, we can note that the metric value for Majority vote has to be larger or equal to -2 (minus two) in order predict bug-covering questions.