Get the Data

We start by reading in the data from homework 1:

# Read in our data
rawdata <- read_csv("hw1.csv", col_types = "dff")

Let’s plot the data to refresh our memory about it:

# Plot
rawdata %>% ggplot(aes(x=x, y=y, col=label)) + geom_point() +
  scale_color_manual(name="Label",values=c("BLUE"="blue","BLACK"="black"))

We have one continuous and one categorical predictor variable. To make our modeling easier, we will transform the categorical predictor to dummy variables:

# Change the variable y to dummy variables
dat <- rawdata %>% select(label, x) %>%
  bind_cols( as.data.frame(psych::dummy.code(rawdata$y)))

dat %>% kbl() %>% kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "600px", height = "200px")
label x a b c d e f
BLUE 5 1 0 0 0 0 0
BLACK 5 0 1 0 0 0 0
BLUE 5 0 0 1 0 0 0
BLACK 5 0 0 0 1 0 0
BLACK 5 0 0 0 0 1 0
BLACK 5 0 0 0 0 0 1
BLUE 19 1 0 0 0 0 0
BLUE 19 0 1 0 0 0 0
BLUE 19 0 0 1 0 0 0
BLUE 19 0 0 0 1 0 0
BLACK 19 0 0 0 0 1 0
BLUE 19 0 0 0 0 0 1
BLACK 35 1 0 0 0 0 0
BLACK 35 0 1 0 0 0 0
BLUE 35 0 0 1 0 0 0
BLACK 35 0 0 0 1 0 0
BLACK 35 0 0 0 0 1 0
BLACK 35 0 0 0 0 0 1
BLACK 51 1 0 0 0 0 0
BLACK 51 0 1 0 0 0 0
BLUE 51 0 0 1 0 0 0
BLACK 51 0 0 0 1 0 0
BLACK 51 0 0 0 0 1 0
BLACK 51 0 0 0 0 0 1
BLACK 55 1 0 0 0 0 0
BLACK 55 0 1 0 0 0 0
BLACK 55 0 0 1 0 0 0
BLACK 55 0 0 0 1 0 0
BLACK 55 0 0 0 0 1 0
BLACK 55 0 0 0 0 0 1
BLACK 63 1 0 0 0 0 0
BLUE 63 0 1 0 0 0 0
BLUE 63 0 0 1 0 0 0
BLUE 63 0 0 0 1 0 0
BLUE 63 0 0 0 0 1 0
BLUE 63 0 0 0 0 0 1

Now we move on to our first method…

Bagged Trees

# Create the bagged tree model
set.seed(735923)

model.bag <- bagging(label ~ ., data = dat, coob=T)

model.bag
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = label ~ ., data = dat, coob = T)
## 
## Out-of-bag estimate of misclassification error:  0.2222

The error (out of bag) is predicted at 0.2222, which is decent. Let’s try the LOOCV method next and see how that error rate compares:

LOOCV

First we will write a function to do the leave one out cross-validation:

# Function to perform LOOCV

LOOCV <- function(x){
  # Accuracy aggregation
  acc <- rep(0,nrow(x))
  
  for(i in 1:nrow(dat)){
    train <- dat[-i,]
    test <- dat[i,]
    mod <- rpart(label ~ ., data=dat)
    pred <- predict(mod, newdata=test, type="class")
    
    acc[i] <- ifelse(pred == test$label,1,0)
  }
  
  # Return test error
  return(1 - mean(acc))
}

Now, we will use the function we just made to run the cross-validations and generate an average error rate:

# Error rate
LOOCV(dat)
## [1] 0.3333333

The LOOCV method has a mean test error rate of 0.3333. That is a bit higher than the bagging method above.

Comparison to Homework 1

Comparing the results above to the ones obtained in homework 1: