Rpub link: https://rpubs.com/oggyluky11/691582

1 Instruction

Use the dataset you used for HW-1 (Blue/Black)

Run Bagging (ipred package)

– sample with replacement

– estimate metrics for a model

– repeat as many times as specied and report the average

Run LOOCV (jacknife) for the same dataset

— iterate over all points

– keep one observation as test

– train using the rest of the observations

– determine test metrics

– aggregate the test metrics

end of loop

find the average of the test metric(s)

Compare (A), (B) above with the results you obtained in HW-1 and write 3 sentences explaining the

observed difference.

library(tidyverse)
library(tidymodels)
library(ipred)
library(caret)

2 Data Set

The dataset for homework 1 is used in this test.

data <- read_csv('https://raw.githubusercontent.com/oggyluky11/DATA622-FALL-2020/main/HW1/data.csv') %>%
  mutate_if(is.character,as.factor)

## 
## -- Column specification ------------------
## cols(
##   X = col_double(),
##   Y = col_character(),
##   label = col_character()
## )

data

3 Train-Test-Split

Seperate data set into training set (75%) and testing set (25%).

set.seed(123)
train_test_split <- initial_split(data, prop = 0.75, strata = 'label')
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

4 Modeling with Bagging

4.1 Train Model

Package ipred is requested in this test. The number of bags is set to 100 as demostrated in course materials.

bagging_model <- bagging(formula = label ~ .,
                         data = data_train, 
                         nbagg = 100)

4.2 Accuracy on Predicton

4.2.1 Make Prediction

bagging_model_pred <- predict(bagging_model, data_test)

data_test_pred <- data_test %>% 
  cbind(prediction = bagging_model_pred)

data_test_pred

4.3 Comfusion Matrix and Statistics for Testing Set

bagging_cfm <- table(data_test_pred$label,data_test_pred$prediction) %>%
  confusionMatrix()

bagging_cfm

## Confusion Matrix and Statistics
## 
##        
##         BLACK BLUE
##   BLACK     5    0
##   BLUE      1    2
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.4735, 0.9968)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.3671          
##                                           
##                   Kappa : 0.7143          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.7500          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : BLACK           
##

5 Modeling with LOOCV

5.1 Train Model

model_loocv_train <- do.call('rbind', lapply(1:nrow(data_train), FUN = function(row_id, data = data_train){
  
  #base model with training data (remove current row)
  dt_model <- decision_tree() %>%
    set_engine('rpart') %>%
  set_mode('classification') %>%
  fit(label ~.,data[-row_id,])
  
  list(fold = row_id, model = dt_model, actual = data[row_id, 3])
  
})) %>% data.frame()

5.2 Make Prediction

model_loocv_pred <- do.call('rbind', lapply(model_loocv_train$model, FUN = function(m, data = data_test){
  
  # make prediction for individual models
  pred <- predict(m, data_test) %>%
    cbind(data_test) %>%
    rename(prediction = '.pred_class') %>%
    select(-prediction, prediction)
})) %>%
  #  make final prediction using 'Voting'
  mutate(label_fctr = ifelse(label == 'BLUE', 0, 1),
         prediction_fctr = ifelse(prediction == 'BLUE',0,1)) %>%
  group_by(X,Y, label, label_fctr) %>%
  summarise(prob = sum(prediction_fctr)/n()) %>%
  mutate(prediction = ifelse(prob > 0.5, 'BLACK', 'BLUE')) %>%
  select(X, Y, label, prediction)

## `summarise()` regrouping output by 'X', 'Y', 'label' (override with `.groups` argument)

model_loocv_pred

5.3 Comfusion Matrix and Statistics for Testing Set

loocv_cfm <- table(model_loocv_pred$label,model_loocv_pred$prediction) %>%
  confusionMatrix()

loocv_cfm

## Confusion Matrix and Statistics
## 
##        
##         BLACK BLUE
##   BLACK     3    2
##   BLUE      1    2
##                                           
##                Accuracy : 0.625           
##                  95% CI : (0.2449, 0.9148)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.3633          
##                                           
##                   Kappa : 0.25            
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.6250          
##                                           
##        'Positive' Class : BLACK           
##

6 Performance Comparison with Result in HW1

In HomeWork 1, the model that has best capacity to Learn is KNN with K = 3 (with accuracy 1.00 on training set but 0.625 on testing set) and the model that Has best performance on generalizing is Naive Bayes (with accuracy 0.875 on testing set). See Home_Work_1_Submission for reference.

test_acc_knn3 <- 0.625
test_acc_nb <- 0.875

cbind(Model = c('KNN_3', 
                'NAIVE BAYES', 
                'BAGGING', 
                'LOOCV'), 
      Accuracy_on_Testing_Set = c(test_acc_knn3,
                                  test_acc_nb,
                                  bagging_cfm$overall[1],
                                  loocv_cfm$overall[1])) %>%
  data.frame(row.names = 'Model') %>%
  arrange(desc(Accuracy_on_Testing_Set))

6.1 Summary

Based on the same base model (Decision Tree with engine rpart), Bagging has better performance than LOOCV, this may due to Bagging has better effect on lowering bias and variance on small data set.
Bagging has the same performance as Naive Bayes, this result may change with different data set, but it also hints that with data of small size, Naive Bayes also shows good performance as compared to more complex data models.
LOOCV has worse performance than KNN_3, the may due to LOOCV does not perfrom re-sampling therefore it is still strictly limited to the size of sample, while KNN has less negative impact on small dataset.

DATA622 TEST 1