622-Test1

Data Intake

I read in the data, set our seed, and split the data into a train and test group. Because of the limited dataset (only 36 instances), I have decided to split the data with 60% in the training set and 40% in the test set. The reason is that there is a variable with 6 factors, and I would like each factor to be present in the test set.

file <- read.csv('hw1.csv')
file <- file %>% sapply(FUN = str_trim) %>% as.data.frame()
file$X <- as.integer(as.character(file$X))
dummies <- dummyVars(" ~ .", file$Y)
dummy_df <- data.frame(predict(dummies, newdata = data.frame(data = file$Y)))
data_dummies <- cbind(file[,1], dummy_df)
data <- file[,1:2]

targets <- file[,3]

set.seed(12345)

test_index <- createDataPartition(targets, p=0.4, list=F)

train_data <- data[-test_index,]
train_targets <- targets[-test_index]
test_data <- data[test_index,]
test_targets <- targets[test_index]

Train Models

Here we train and calculate the test metrics for the bagging model created by the ipredbagg packcage. Since the target variable is a factor, the weak learners used in the model will be decision trees which will then be ensembled to form the bagged model.

bagged <- ipredbagg(train_targets, train_data)
train_preds <- predict(bagged, newdata=train_data, type = "prob")
test_preds <- predict(bagged, newdata=test_data, type = "prob")
train_roc <- roc(as.integer(train_targets), train_preds[,1])$auc

## Setting levels: control = 1, case = 2

## Setting direction: controls > cases

test_roc <- roc(as.integer(test_targets), test_preds[,1])$auc

## Setting levels: control = 1, case = 2
## Setting direction: controls > cases

train_confusion <- confusionMatrix(predict(bagged, train_data), train_targets)
test_confusion <- confusionMatrix(predict(bagged, test_data), test_targets)

train_roc

## Area under the curve: 0.9904

test_roc

## Area under the curve: 0.6944

train_confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    12    1
##      BLUE      1    7
##                                           
##                Accuracy : 0.9048          
##                  95% CI : (0.6962, 0.9883)
##     No Information Rate : 0.619           
##     P-Value [Acc > NIR] : 0.003952        
##                                           
##                   Kappa : 0.7981          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.8750          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.6190          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.6190          
##       Balanced Accuracy : 0.8990          
##                                           
##        'Positive' Class : BLACK           
##

test_confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK     5    2
##      BLUE      4    4
##                                           
##                Accuracy : 0.6             
##                  95% CI : (0.3229, 0.8366)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.6098          
##                                           
##                   Kappa : 0.2105          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.5556          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.6111          
##                                           
##        'Positive' Class : BLACK           
##

We see that the model performs very well on the training set but somewhat poorly on the test set. This indicates that the model is likely overfitting the training data, and would see improved test performance with some form of cross validation.

Jackknife Validation

Now I will perform leave one out cross-validation, which is simply k-fold cross validation with each fold holding only 1 instance. The aggregated test metrics can be found below.

N <- length(targets)
models <- list()
preds <- c()
probs <- c()
for (i in 1:N){
  X <- data[-i,]
  y <- targets[-i]
  test_x <- data[i,]
  test_y <- targets[i]
  
  bagged_train <- ipredbagg(y, X)
  prediction <- predict(bagged_train, newdata=test_x, type='prob')
  pred <- ifelse(prediction[1] > 0.5, 'BLACK', 'BLUE')
  row <- list(i, bagged_train, pred, targets[i])
  models <- append(models, bagged_train)
  preds <- append(preds, pred)
  probs <- append(probs, prediction[1])
}
jackknife_df <- data.frame(index=c(1:N), prediction=preds, probs=probs, target=targets)

jackknife_roc <- roc(as.integer(targets), jackknife_df$probs)$auc

## Setting levels: control = 1, case = 2

## Setting direction: controls > cases

jackknife_confusion <- confusionMatrix(targets, jackknife_df$prediction)

jackknife_roc

## Area under the curve: 0.8328

jackknife_confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    17    5
##      BLUE      4   10
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.578, 0.8788)
##     No Information Rate : 0.5833         
##     P-Value [Acc > NIR] : 0.02898        
##                                          
##                   Kappa : 0.4808         
##                                          
##  Mcnemar's Test P-Value : 1.00000        
##                                          
##             Sensitivity : 0.8095         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.7727         
##          Neg Pred Value : 0.7143         
##              Prevalence : 0.5833         
##          Detection Rate : 0.4722         
##    Detection Prevalence : 0.6111         
##       Balanced Accuracy : 0.7381         
##                                          
##        'Positive' Class : BLACK          
##

We see that using leave one out cross-validation improved performance over a simple train-test split when using a bagged model. This is to be expected, as we increased the training data for our models by almost 40% (the size of the traditional test set). However, because the size of our data is so small, it is likely that these results are highly variable, and I would prefer to confirm them with more data.

Interestingly, these models appeared to underperform both naive bayes and knn algorithms on this dataset. Using jackknife cross validation, the bagged model achieved 75% accuracy, but in my previous work, both naive bayes and KNN achieved an accuracy of approximately 88% on the test set. Again, these results are likely highly variable, but could indicate some structure in the data is more fitted to knn/naive bayes algorithms than decision trees.