Define boundary Function for Decision Boundaries
- Logistic Regression
- KNN
- LDA
- Decision Tree
- Bagging
- Random Forest
- Boosting
- SVM
Lab Assignment

Classifiers create decision boundaries to discriminate between classes. Different classifiers are able to create different shapes of decision boundaries (e.g., some are strictly linear) and thus some classifiers may perform better for certain datasets. This week let’s visualize the decision boundaries found by several popular classification methods.

Define `boundary` Function for Decision Boundaries

The following function adds the decision boundary by evaluating the classifier at evenly spaced grid points. Note that low resolution (to make evaluation faster) will make the decision boundary look like it has small steps even if it is a (straight) line.

# input: 
# model: classification model
# data: training set
# class: response variable

boundary <- function(model, data, class = NULL, predict_type = "class",
  resolution = 100, showgrid = TRUE, ...) {

  if(!is.null(class)) cl <- data[,class] else cl <- 1
  data <- data[,1:2]
  k <- length(unique(cl))

  plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...)

  # make grid
  r <- sapply(data, range, na.rm = TRUE)
  xs <- seq(r[1,1], r[2,1], length.out = resolution)
  ys <- seq(r[1,2], r[2,2], length.out = resolution)
  g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
  colnames(g) <- colnames(r)
  g <- as.data.frame(g)

  ### guess how to get class labels from predict
  ### (unfortunately not very consistent between models)
  p <- predict(model, g, type = predict_type)
  if(is.list(p)) p <- p$class
  p <- as.factor(p)

  if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")

  z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
  contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
    lwd = 2, levels = (1:(k-1))+.5)

  invisible(z)
}

Let’s use the IRIS dataset as an example. For easier visualization, we use on two dimensions of the Iris dataset.

set.seed(1)
data(iris)

# only use two predictors
x <- iris[1:150, c("Sepal.Length", "Sepal.Width", "Species")]

plot(x[,1:2], col = x[,3])

Logistic Regression

For Logistic regression, let’s focus on only two classes since logistic regression can only handle binary outcomes.

model1 <- glm(Species ~., data = x[1:100,], family=binomial(link='logit'))

class(model1) <- c("lr", class(model1))

# specify the cutoff point for prediction

predict.lr <- function(object, newdata, ...)
  predict.glm(object, newdata, type = "response") > .5

boundary(model1, x[1:100,], class = "Species", main = "Logistic Regression")

KNN

library(caret)
# K=1
model2 <- knn3(Species ~ ., data=x, k = 1)
boundary(model2, x, class = "Species", main = "kNN (k = 1)")

# K=10
model3 <- knn3(Species ~ ., data=x, k = 10)
boundary(model3, x, class = "Species", main = "kNN (k = 10)")

LDA

library(MASS)
model4 <- lda(Species ~ ., data=x)
boundary(model4, x, class = "Species", main = "LDA")

Decision Tree

library(tree)
model5 <- tree(Species ~ ., data=x)
boundary(model5, x, class = "Species", main = "Decision Tree")

Bagging

library(randomForest)

# Bagging is the random forest when m = p
# set mtry = 2 since we have two predictors
model6 <- randomForest(Species ~ ., data=x, mtry = 2, importance = TRUE)
boundary(model6, x, class = "Species", main = "Bagging")

Random Forest

library(randomForest)

model7 <- randomForest(Species ~ ., data=x)
boundary(model7, x, class = "Species", main = "Random Forest")

Boosting

library(gbm)
# distribution="bernoulli" suggests this is a classification model
# n.trees = 1000 means fitting 1000 classification tree
# the depth of each tree is 5

model8 <- gbm(Species~.,data=x, distribution=
"multinomial",n.trees=1000, interaction.depth=4)


class(model8) <- c("lr", class(model8))

# specify the prediction results: the class with the highest probability 

predict.lr <- function(object, newdata, ...){
 c("setosa","versicolor", "virginica")[apply(predict.gbm(object, newdata, type = "response"), 1, which.max)]}
boundary(model8, x, class = "Species", main = "Boosting")

SVM

library(e1071)

# Linear Kernel
model9 <- svm(Species ~ ., data=x, kernel="linear")
boundary(model9, x, class = "Species", main = "SVD (linear)")

# polynomial kernel

model8 <- svm(Species ~ ., data=x, kernel = "polynomial")
boundary(model8, x, class = "Species", main = "SVD (polynomial)")

Lab Assignment

Use the following code to simulate our dataset. There are two numerical variable X1 and X2. The response variable Y has two classes: A and B. Make sure to specify set.seed(1).

library(ggplot2)
set.seed(1)
x <- rnorm(100,20,8)
y <- rnorm(100,20,8)
z <- as.factor(ifelse((x-20)^2+(y-20)^2<10^2, "A","B"))
data <- data.frame(X1=x, X2 = y, Y = z)
# two predictors: X1, X2
# Binary response: Y

ggplot(data) + geom_point(aes(x = X1, y = X2, color = Y))

Now you are going to put a classifier on this data. Try the following models:

1. Logistic Regression
1. KNN Classification
1. LDA
1. Decision Tree
1. Bagging
1. Random forest

For each model,

Make necessary adjustment on the data based on the model assumption
Fit the classification model, use your own way to tune the parameter(write a short paragraph to describe your tuning procedure. No need to report the accuracy or any error metric)
Generate the decision boundary for each model. Write a short paragraph on the shape of the decision boundary.
In this study, the actual boundary is known. z <- as.factor(ifelse((x-20)^2+(y-20)^2<10^2, "A","B")). Compare the performance of each model with the actual boundary. Discuss which model do you think is the best in our study.

Decision Boundary

Zhe

10/22/2021

Define `boundary` Function for Decision Boundaries

Logistic Regression

KNN

LDA

Decision Tree

Bagging

Random Forest

Boosting

SVM

Lab Assignment

Decision Boundary

Zhe

10/22/2021

Define boundary Function for Decision Boundaries

Logistic Regression

KNN

LDA

Decision Tree

Bagging

Random Forest

Boosting

SVM

Lab Assignment

Define `boundary` Function for Decision Boundaries