DATA 622 - Test 1

Get the Data

We start by reading in the data from homework 1:

# Read in our data
rawdata <- read_csv("hw1.csv", col_types = "dff")

Let’s plot the data to refresh our memory about it:

# Plot
rawdata %>% ggplot(aes(x=x, y=y, col=label)) + geom_point() +
  scale_color_manual(name="Label",values=c("BLUE"="blue","BLACK"="black"))

We have one continuous and one categorical predictor variable. To make our modeling easier, we will transform the categorical predictor to dummy variables:

# Change the variable y to dummy variables
dat <- rawdata %>% select(label, x) %>%
  bind_cols( as.data.frame(psych::dummy.code(rawdata$y)))

dat %>% kbl() %>% kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "600px", height = "200px")

label	x	a	b	c	d	e	f
BLUE	5	1	0	0	0	0	0
BLACK	5	0	1	0	0	0	0
BLUE	5	0	0	1	0	0	0
BLACK	5	0	0	0	1	0	0
BLACK	5	0	0	0	0	1	0
BLACK	5	0	0	0	0	0	1
BLUE	19	1	0	0	0	0	0
BLUE	19	0	1	0	0	0	0
BLUE	19	0	0	1	0	0	0
BLUE	19	0	0	0	1	0	0
BLACK	19	0	0	0	0	1	0
BLUE	19	0	0	0	0	0	1
BLACK	35	1	0	0	0	0	0
BLACK	35	0	1	0	0	0	0
BLUE	35	0	0	1	0	0	0
BLACK	35	0	0	0	1	0	0
BLACK	35	0	0	0	0	1	0
BLACK	35	0	0	0	0	0	1
BLACK	51	1	0	0	0	0	0
BLACK	51	0	1	0	0	0	0
BLUE	51	0	0	1	0	0	0
BLACK	51	0	0	0	1	0	0
BLACK	51	0	0	0	0	1	0
BLACK	51	0	0	0	0	0	1
BLACK	55	1	0	0	0	0	0
BLACK	55	0	1	0	0	0	0
BLACK	55	0	0	1	0	0	0
BLACK	55	0	0	0	1	0	0
BLACK	55	0	0	0	0	1	0
BLACK	55	0	0	0	0	0	1
BLACK	63	1	0	0	0	0	0
BLUE	63	0	1	0	0	0	0
BLUE	63	0	0	1	0	0	0
BLUE	63	0	0	0	1	0	0
BLUE	63	0	0	0	0	1	0
BLUE	63	0	0	0	0	0	1

Now we move on to our first method…

Bagged Trees

# Create the bagged tree model
set.seed(735923)

model.bag <- bagging(label ~ ., data = dat, coob=T)

model.bag

## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = label ~ ., data = dat, coob = T)
## 
## Out-of-bag estimate of misclassification error:  0.2222

The error (out of bag) is predicted at 0.2222, which is decent. Let’s try the LOOCV method next and see how that error rate compares:

LOOCV

First we will write a function to do the leave one out cross-validation:

# Function to perform LOOCV

LOOCV <- function(x){
  # Accuracy aggregation
  acc <- rep(0,nrow(x))
  
  for(i in 1:nrow(dat)){
    train <- dat[-i,]
    test <- dat[i,]
    mod <- rpart(label ~ ., data=dat)
    pred <- predict(mod, newdata=test, type="class")
    
    acc[i] <- ifelse(pred == test$label,1,0)
  }
  
  # Return test error
  return(1 - mean(acc))
}

Now, we will use the function we just made to run the cross-validations and generate an average error rate:

# Error rate
LOOCV(dat)

## [1] 0.3333333

The LOOCV method has a mean test error rate of 0.3333. That is a bit higher than the bagging method above.

Comparison to Homework 1

Comparing the results above to the ones obtained in homework 1:

The test error rates for the methods above (Bagged Tree, LOOCV) are better than those obtained in Homework 1 (Logistic Regression, Naive Bayes, and KNN). That is likely because I felt the classifiers in Homework 1 were overfitting on the training data and thus performed much more poorly on the test data.
The bagging method above appears to work better than the LOOCV. I feel this is because the LOOCV method exhibits a lot of correlation between model iterations. That’s because the data set used to train each model iteration is only different by 1 observation. Because the bagging method uses a bootstrap method (sampling with replacement) it has less of that problem.

DATA 622 - Test 1

Adam Douglas

11/12/2020

Get the Data

Bagged Trees

LOOCV

Comparison to Homework 1