The goal of the study

Making sense of multiple answers to one same question is challenging if one needs to extract some kind of consensus. The challenges relate to the different levels of skill, but also personal preferences.

In the case of my experiment, I asked workers if they believed that a code fragment was related to a unit test failure. Each question was asked to 20 different workers. Workers answered YES, NO, or I don´t know.

After receiving all the answers for each code fragments, I was left with the challenge of using these answers to predict the code fragments that are most probability related to the failure. I labeled as “bug covering” the questions that were about code fragments that turned out to be related to the unit test failure.

Whe study has two goals:
  • Train a machine learning algorithm that predicts whether a code fragment is related to a failure or not. For that, I originally devised different metrics. The metric that will explore in the following study consists of ranking the questions by the number of YES answers received. Questions that received the most number of YES answers are assigned ranking level 1.
  • Compared CLASS and CARET implementations of k-nearest neighbor algorithm. Results showed that CLASS implementation produced fewer false positives than CARET. Note that this might result of overfitting of my use of CLASS configuration of knn.cv.
  • Data preparation

    I need to guarantee that some examples (i.e., failing methods) do not dominate the training or testing sets. To do that, I need to get a close to equal proportion of examples in both sets. I do that by scrambling the data.

    set.seed(9850);
    g<- runif((nrow(summaryTable))); #generates a random distribution
    summaryTable <- summaryTable[order(g),];#reorder the rows based on a random index
    
    #convert columns to numeric
    summaryTable[,"rankingVote"] <- as.numeric(unlist(summaryTable[,"rankingVote"])); 
    
    #Select only the ranking as a feature to predict bugCovering
    trainingData <- summaryTable[,c("bugCovering","rankingVote")];
    trainingData$rankingVote <- as.numeric(trainingData$rankingVote);

    KNN from CLASS package

    I chose knn.cv (cross validation) so I can minimize the risk of lucky selection of training and testing set.

    Cross validations is performed by leaving one out, however even when I set partition to 70/30 I obtained the same reaults.

    I also run with differnt levels of k=3,5,7,9, which produced the same results as well

    #build model
    fitModel.cv <- knn.cv(trainingData, trainingData$bugCovering, k=3, l=0, prob = FALSE, use.all=TRUE);

    Testing the model

    fitModel.cv.df<-data.frame(fitModel.cv);
    CrossTable(x = trainingData$bugCovering, y=fitModel.cv.df[,1], prop.chisq = FALSE)
    ## 
    ##  
    ##    Cell Contents
    ## |-------------------------|
    ## |                       N |
    ## |           N / Row Total |
    ## |           N / Col Total |
    ## |         N / Table Total |
    ## |-------------------------|
    ## 
    ##  
    ## Total Observations in Table:  129 
    ## 
    ##  
    ##                          | fitModel.cv.df[, 1] 
    ## trainingData$bugCovering |     FALSE |      TRUE | Row Total | 
    ## -------------------------|-----------|-----------|-----------|
    ##                    FALSE |       102 |         2 |       104 | 
    ##                          |     0.981 |     0.019 |     0.806 | 
    ##                          |     0.944 |     0.095 |           | 
    ##                          |     0.791 |     0.016 |           | 
    ## -------------------------|-----------|-----------|-----------|
    ##                     TRUE |         6 |        19 |        25 | 
    ##                          |     0.240 |     0.760 |     0.194 | 
    ##                          |     0.056 |     0.905 |           | 
    ##                          |     0.047 |     0.147 |           | 
    ## -------------------------|-----------|-----------|-----------|
    ##             Column Total |       108 |        21 |       129 | 
    ##                          |     0.837 |     0.163 |           | 
    ## -------------------------|-----------|-----------|-----------|
    ## 
    ## 

    Estimating the metric

    Discover the minimal ranking value that would have predicted the same bug Covering questions

    Mean ranking vote of the questions categorized as bug covering:

    ## [1] 1.714286

    Top ranking vote of the questions categorized as bug covering:

    ## [1] 1

    Lowest ranking vote of the questions still categorized as bug covering:

    ## [1] 3

    Plot metric distribution

    predictedList.df <- data.frame(predictedList);
    colnames(predictedList.df)<- c("votes");
    
    ggplot(data=predictedList.df, aes(x=predictedList.df$votes)) +
      geom_histogram(binwidth = 0.5,alpha=.5, position="identity")+
      geom_vline(aes(xintercept=mean(predictedList.df$votes, na.rm=T)),   # Ignore NA values for mean
                 color="red", linetype="dashed", size=1) +
      ggtitle("Ranking of questions predicted as bug covering")+
      labs(x="Ranking of YES votes of the questions categorized as bug-covering. lowest ranking=3, mean=1.71", 
           y="Frequency");

    Why not use k=1?

    Last but not least, I also tried k=1, but the results seemed to overfit the training set. The reason is that for k=1 the algorithm estimates the probability based on a single data point, which is the closest neighbor. This ends up being very sensitive to distortions such as outliers, noise, or data mislabelling. Moreover, since I am using cross-validation which uses the same dataset to train and test by leaving one out, there is a even higher risks of comparing the test data point against itself because the sampling is done with replacement. Therefore, I decided to use values of k larger than 3 and odd in order to avoid ties.

    For a more detailed discussion on k=1 value, refer to https://stats.stackexchange.com/questions/107870/does-k-nn-with-k-1-always-implies-overfitting#107913

    KNN from CARET package

    https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf

    https://dataaspirant.com/2017/01/09/knn-implementation-r-using-caret-package/

    I will do 5 repeats of 10-Fold CV. I will fit a KNN model that evaluates 10 values of k

    ## [1] 5.248062

    Model selection

    knn_fit
    ## k-Nearest Neighbors 
    ## 
    ## 129 samples
    ##   1 predictor
    ##   2 classes: 'FALSE', 'TRUE' 
    ## 
    ## Pre-processing: centered (1), scaled (1) 
    ## Resampling: Cross-Validated (10 fold, repeated 5 times) 
    ## Summary of sample sizes: 116, 117, 116, 116, 116, 115, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   k   Accuracy   Kappa    
    ##    5  0.8731685  0.5275465
    ##    7  0.8717399  0.5196293
    ##    9  0.8760256  0.5339682
    ##   11  0.8774359  0.5200817
    ##   13  0.8760256  0.5382455
    ##   15  0.8760256  0.5382455
    ##   17  0.8760256  0.5382455
    ##   19  0.8760256  0.5382455
    ##   21  0.8760256  0.5382455
    ##   23  0.8760256  0.5382455
    ## 
    ## Accuracy was used to select the optimal model using  the largest value.
    ## The final value used for the model was k = 11.

    Confusion matrix using the selected model

    bugCoveringPredicted <- predict(knn_fit,newdata = trainingData);
    
    confusionMatrix(data=bugCoveringPredicted,trainingData$bugCovering, mode="prec_recall", positive="TRUE")
    ## Confusion Matrix and Statistics
    ## 
    ##           Reference
    ## Prediction FALSE TRUE
    ##      FALSE    98   10
    ##      TRUE      6   15
    ##                                           
    ##                Accuracy : 0.876           
    ##                  95% CI : (0.8064, 0.9274)
    ##     No Information Rate : 0.8062          
    ##     P-Value [Acc > NIR] : 0.02467         
    ##                                           
    ##                   Kappa : 0.5774          
    ##  Mcnemar's Test P-Value : 0.45325         
    ##                                           
    ##               Precision : 0.7143          
    ##                  Recall : 0.6000          
    ##                      F1 : 0.6522          
    ##              Prevalence : 0.1938          
    ##          Detection Rate : 0.1163          
    ##    Detection Prevalence : 0.1628          
    ##       Balanced Accuracy : 0.7712          
    ##                                           
    ##        'Positive' Class : TRUE            
    ## 

    Compute the minimal ranking value that corresponded to the predicted bugCovering questions

    ## [1] 1.52381
    ## [1] 2

    By the distribution of ranking outcomes, we can note that the metric value for Ranking vote has to be larger or equal to 3 (three) in order predict bug-covering questions.

    Conclusions

    Caret produced more false positives than knn.cv from class package. I tried to fine tune it more, but was not enough. The k value for Caret seems very high 15 t 23. However, these differences might be due to my use of knn.cv to be overfitting.