DATA622 TEST 1
Rpub link: https://rpubs.com/oggyluky11/691582
1 Instruction
Use the dataset you used for HW-1 (Blue/Black)
- Run Bagging (ipred package)
– sample with replacement
– estimate metrics for a model
– repeat as many times as specied and report the average
- Run LOOCV (jacknife) for the same dataset
— iterate over all points
– keep one observation as test
– train using the rest of the observations
– determine test metrics
– aggregate the test metrics
end of loop
find the average of the test metric(s)
Compare (A), (B) above with the results you obtained in HW-1 and write 3 sentences explaining the
observed difference.
2 Data Set
The dataset for homework 1 is used in this test.
data <- read_csv('https://raw.githubusercontent.com/oggyluky11/DATA622-FALL-2020/main/HW1/data.csv') %>%
mutate_if(is.character,as.factor)
##
## -- Column specification ------------------
## cols(
## X = col_double(),
## Y = col_character(),
## label = col_character()
## )
3 Train-Test-Split
Seperate data set into training set (75%) and testing set (25%).
4 Modeling with Bagging
4.1 Train Model
Package ipred
is requested in this test. The number of bags is set to 100 as demostrated in course materials.
4.2 Accuracy on Predicton
4.3 Comfusion Matrix and Statistics for Testing Set
bagging_cfm <- table(data_test_pred$label,data_test_pred$prediction) %>%
confusionMatrix()
bagging_cfm
## Confusion Matrix and Statistics
##
##
## BLACK BLUE
## BLACK 5 0
## BLUE 1 2
##
## Accuracy : 0.875
## 95% CI : (0.4735, 0.9968)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.3671
##
## Kappa : 0.7143
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8333
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.6667
## Prevalence : 0.7500
## Detection Rate : 0.6250
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : BLACK
##
5 Modeling with LOOCV
5.1 Train Model
model_loocv_train <- do.call('rbind', lapply(1:nrow(data_train), FUN = function(row_id, data = data_train){
#base model with training data (remove current row)
dt_model <- decision_tree() %>%
set_engine('rpart') %>%
set_mode('classification') %>%
fit(label ~.,data[-row_id,])
list(fold = row_id, model = dt_model, actual = data[row_id, 3])
})) %>% data.frame()
5.2 Make Prediction
model_loocv_pred <- do.call('rbind', lapply(model_loocv_train$model, FUN = function(m, data = data_test){
# make prediction for individual models
pred <- predict(m, data_test) %>%
cbind(data_test) %>%
rename(prediction = '.pred_class') %>%
select(-prediction, prediction)
})) %>%
# make final prediction using 'Voting'
mutate(label_fctr = ifelse(label == 'BLUE', 0, 1),
prediction_fctr = ifelse(prediction == 'BLUE',0,1)) %>%
group_by(X,Y, label, label_fctr) %>%
summarise(prob = sum(prediction_fctr)/n()) %>%
mutate(prediction = ifelse(prob > 0.5, 'BLACK', 'BLUE')) %>%
select(X, Y, label, prediction)
## `summarise()` regrouping output by 'X', 'Y', 'label' (override with `.groups` argument)
5.3 Comfusion Matrix and Statistics for Testing Set
loocv_cfm <- table(model_loocv_pred$label,model_loocv_pred$prediction) %>%
confusionMatrix()
loocv_cfm
## Confusion Matrix and Statistics
##
##
## BLACK BLUE
## BLACK 3 2
## BLUE 1 2
##
## Accuracy : 0.625
## 95% CI : (0.2449, 0.9148)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.3633
##
## Kappa : 0.25
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7500
## Specificity : 0.5000
## Pos Pred Value : 0.6000
## Neg Pred Value : 0.6667
## Prevalence : 0.5000
## Detection Rate : 0.3750
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.6250
##
## 'Positive' Class : BLACK
##
6 Performance Comparison with Result in HW1
In HomeWork 1, the model that has best capacity to Learn is KNN with K = 3 (with accuracy 1.00 on training set but 0.625 on testing set) and the model that Has best performance on generalizing is Naive Bayes (with accuracy 0.875 on testing set). See Home_Work_1_Submission for reference.
test_acc_knn3 <- 0.625
test_acc_nb <- 0.875
cbind(Model = c('KNN_3',
'NAIVE BAYES',
'BAGGING',
'LOOCV'),
Accuracy_on_Testing_Set = c(test_acc_knn3,
test_acc_nb,
bagging_cfm$overall[1],
loocv_cfm$overall[1])) %>%
data.frame(row.names = 'Model') %>%
arrange(desc(Accuracy_on_Testing_Set))
6.1 Summary
- Based on the same base model (Decision Tree with engine
rpart
), Bagging has better performance than LOOCV, this may due to Bagging has better effect on lowering bias and variance on small data set. - Bagging has the same performance as Naive Bayes, this result may change with different data set, but it also hints that with data of small size, Naive Bayes also shows good performance as compared to more complex data models.
- LOOCV has worse performance than KNN_3, the may due to LOOCV does not perfrom re-sampling therefore it is still strictly limited to the size of sample, while KNN has less negative impact on small dataset.