Rpub link: https://rpubs.com/oggyluky11/691582

1 Instruction

Use the dataset you used for HW-1 (Blue/Black)

  1. Run Bagging (ipred package)

– sample with replacement

– estimate metrics for a model

– repeat as many times as specied and report the average

  1. Run LOOCV (jacknife) for the same dataset

— iterate over all points

– keep one observation as test

– train using the rest of the observations

– determine test metrics

– aggregate the test metrics

end of loop

find the average of the test metric(s)

Compare (A), (B) above with the results you obtained in HW-1 and write 3 sentences explaining the

observed difference.

2 Data Set

The dataset for homework 1 is used in this test.

## 
## -- Column specification ------------------
## cols(
##   X = col_double(),
##   Y = col_character(),
##   label = col_character()
## )

4 Modeling with Bagging

4.1 Train Model

Package ipred is requested in this test. The number of bags is set to 100 as demostrated in course materials.

4.2 Accuracy on Predicton

4.2.1 Make Prediction

4.3 Comfusion Matrix and Statistics for Testing Set

## Confusion Matrix and Statistics
## 
##        
##         BLACK BLUE
##   BLACK     5    0
##   BLUE      1    2
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.4735, 0.9968)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.3671          
##                                           
##                   Kappa : 0.7143          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.7500          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : BLACK           
## 

5 Modeling with LOOCV

5.2 Make Prediction

## `summarise()` regrouping output by 'X', 'Y', 'label' (override with `.groups` argument)

5.3 Comfusion Matrix and Statistics for Testing Set

## Confusion Matrix and Statistics
## 
##        
##         BLACK BLUE
##   BLACK     3    2
##   BLUE      1    2
##                                           
##                Accuracy : 0.625           
##                  95% CI : (0.2449, 0.9148)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.3633          
##                                           
##                   Kappa : 0.25            
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.6250          
##                                           
##        'Positive' Class : BLACK           
## 

6 Performance Comparison with Result in HW1

In HomeWork 1, the model that has best capacity to Learn is KNN with K = 3 (with accuracy 1.00 on training set but 0.625 on testing set) and the model that Has best performance on generalizing is Naive Bayes (with accuracy 0.875 on testing set). See Home_Work_1_Submission for reference.

6.1 Summary

  1. Based on the same base model (Decision Tree with engine rpart), Bagging has better performance than LOOCV, this may due to Bagging has better effect on lowering bias and variance on small data set.
  2. Bagging has the same performance as Naive Bayes, this result may change with different data set, but it also hints that with data of small size, Naive Bayes also shows good performance as compared to more complex data models.
  3. LOOCV has worse performance than KNN_3, the may due to LOOCV does not perfrom re-sampling therefore it is still strictly limited to the size of sample, while KNN has less negative impact on small dataset.