We start by reading in the data from homework 1:
# Read in our data
rawdata <- read_csv("hw1.csv", col_types = "dff")
Let’s plot the data to refresh our memory about it:
# Plot
rawdata %>% ggplot(aes(x=x, y=y, col=label)) + geom_point() +
scale_color_manual(name="Label",values=c("BLUE"="blue","BLACK"="black"))
We have one continuous and one categorical predictor variable. To make our modeling easier, we will transform the categorical predictor to dummy variables:
# Change the variable y to dummy variables
dat <- rawdata %>% select(label, x) %>%
bind_cols( as.data.frame(psych::dummy.code(rawdata$y)))
dat %>% kbl() %>% kable_styling(bootstrap_options = "striped") %>%
scroll_box(width = "600px", height = "200px")
| label | x | a | b | c | d | e | f |
|---|---|---|---|---|---|---|---|
| BLUE | 5 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLACK | 5 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLUE | 5 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLACK | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLACK | 5 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLACK | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| BLUE | 19 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLUE | 19 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLUE | 19 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLUE | 19 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLACK | 19 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLUE | 19 | 0 | 0 | 0 | 0 | 0 | 1 |
| BLACK | 35 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLACK | 35 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLUE | 35 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLACK | 35 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLACK | 35 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLACK | 35 | 0 | 0 | 0 | 0 | 0 | 1 |
| BLACK | 51 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLACK | 51 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLUE | 51 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLACK | 51 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLACK | 51 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLACK | 51 | 0 | 0 | 0 | 0 | 0 | 1 |
| BLACK | 55 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLACK | 55 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLACK | 55 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLACK | 55 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLACK | 55 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLACK | 55 | 0 | 0 | 0 | 0 | 0 | 1 |
| BLACK | 63 | 1 | 0 | 0 | 0 | 0 | 0 |
| BLUE | 63 | 0 | 1 | 0 | 0 | 0 | 0 |
| BLUE | 63 | 0 | 0 | 1 | 0 | 0 | 0 |
| BLUE | 63 | 0 | 0 | 0 | 1 | 0 | 0 |
| BLUE | 63 | 0 | 0 | 0 | 0 | 1 | 0 |
| BLUE | 63 | 0 | 0 | 0 | 0 | 0 | 1 |
Now we move on to our first method…
# Create the bagged tree model
set.seed(735923)
model.bag <- bagging(label ~ ., data = dat, coob=T)
model.bag
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = label ~ ., data = dat, coob = T)
##
## Out-of-bag estimate of misclassification error: 0.2222
The error (out of bag) is predicted at 0.2222, which is decent. Let’s try the LOOCV method next and see how that error rate compares:
First we will write a function to do the leave one out cross-validation:
# Function to perform LOOCV
LOOCV <- function(x){
# Accuracy aggregation
acc <- rep(0,nrow(x))
for(i in 1:nrow(dat)){
train <- dat[-i,]
test <- dat[i,]
mod <- rpart(label ~ ., data=dat)
pred <- predict(mod, newdata=test, type="class")
acc[i] <- ifelse(pred == test$label,1,0)
}
# Return test error
return(1 - mean(acc))
}
Now, we will use the function we just made to run the cross-validations and generate an average error rate:
# Error rate
LOOCV(dat)
## [1] 0.3333333
The LOOCV method has a mean test error rate of 0.3333. That is a bit higher than the bagging method above.
Comparing the results above to the ones obtained in homework 1:
The test error rates for the methods above (Bagged Tree, LOOCV) are better than those obtained in Homework 1 (Logistic Regression, Naive Bayes, and KNN). That is likely because I felt the classifiers in Homework 1 were overfitting on the training data and thus performed much more poorly on the test data.
The bagging method above appears to work better than the LOOCV. I feel this is because the LOOCV method exhibits a lot of correlation between model iterations. That’s because the data set used to train each model iteration is only different by 1 observation. Because the bagging method uses a bootstrap method (sampling with replacement) it has less of that problem.