Section III: Data-Splitting

  1. The first task is to randomly split X.full.zip and Y.full.zip into two a training set and a test set. The test set should be approximately 20% of the full dataset.
## Solution goes here ---------
zip3 <- read.table("~/Desktop/5241/Lab2/zip3.txt", header = FALSE, sep = ",", dec = ".")
zip5 <- read.table("~/Desktop/5241/Lab2/zip5.txt", header = FALSE, sep = ",", dec = ".")

# 合并
X_full <- rbind(zip3, zip5)
# 响应向量
Y_full <- c(rep("Three", nrow(zip3)), rep("Five", nrow(zip5)))
# 目标变量是因子
Y_full <- as.factor(Y_full)
# 检查维度是否一致
#print(dim(X_full))
#print(length(Y_full))
# 检查是否有缺失值
#sum(is.na(X_full))
#sum(is.na(Y_full))
# 确保没有其他的数据问题
#str(X_full)
#str(Y_full)
# 查看前几行数据
#head(X_full)
#head(Y_full)
#install.packages('caret')
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(42)
splitIndex <- createDataPartition(Y_full, p = 0.8, list = FALSE)
X_train <- X_full[splitIndex, ]
X_test <- X_full[-splitIndex, ]
Y_train <- Y_full[splitIndex]
Y_test <- Y_full[-splitIndex]


print(dim(X_train))
## [1] 972 256
print(dim(X_test))
## [1] 242 256

Section V: Modify KNN Function

  1. Your second task is to modify KNN.decision() so it generalizes to any binary classifier. Technically we will only consider numeric features, i.e., no categorical training data. To accomplish this task, your modified kNN-function must be able to: (i) train the model for \(p>0\) features, and (ii) classify several test cases at once. Also name your modified function something different than KNN.decision().
## Solution goes here ---------
#modified function
kNN_modified <- function(X_train, Y_train, X_test, K = 5) {
  predictions <- sapply(1:nrow(X_test), function(i) {
    dists <- sqrt(colSums((t(X_train) - X_test[i,])^2))
    neighbors <- order(dists)[1:K]
    neighb_labels <- Y_train[neighbors]
    return(names(which.max(table(neighb_labels))))
  })
  return(predictions)
}

#errors
calculate_errors <- function(Y_true, Y_pred) {
  return(sum(Y_true != Y_pred) / length(Y_true))
}
print(calculate_errors)
## function(Y_true, Y_pred) {
##   return(sum(Y_true != Y_pred) / length(Y_true))
## }
  1. For the third task, use your kNN function to compute the test error and training error based on the split data from Section III. Choose \(K=5\) to compute the test error and training error.
## Solution goes here ---------
K <- 5
Y_train_pred <- kNN_modified(X_train, Y_train, X_train, K)
Y_test_pred <- kNN_modified(X_train, Y_train, X_test, K)

train_error <- calculate_errors(Y_train, Y_train_pred)
test_error <- calculate_errors(Y_test, Y_test_pred)

print(paste("Training error for K =", K, ":", train_error))
## [1] "Training error for K = 5 : 0.457818930041152"
print(paste("Test error for K =", K, ":", test_error))
## [1] "Test error for K = 5 : 0.458677685950413"

Section VI: Tuning Parameter

  1. The final task requires students to compute the training error and test error for several odd values of \(k\). Plot both training and test error as a function of \(k\). Try choosing values of \(k\) at least equal to the vector \(1,3,5,7,9,11\).
## Solution goes here ---------
train_errors <- numeric()
test_errors <- numeric()
k_values <- c(1, 3, 5, 7, 9, 11)

for (k in k_values) {
  Y_train_pred <- kNN_modified(X_train, Y_train, X_train, K = k)
  Y_test_pred <- kNN_modified(X_train, Y_train, X_test, K = k)
  train_errors <- c(train_errors, calculate_errors(Y_train, Y_train_pred))
  test_errors <- c(test_errors, calculate_errors(Y_test, Y_test_pred))
}

error_data <- data.frame(
  K = rep(k_values, each = 2),
  Error = c(train_errors, test_errors),
  Type = rep(c("Training", "Test"), times = length(k_values))
)
ggplot(error_data, aes(x = K, y = Error, color = Type)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title = "Training and Test Error Rates vs. K",
       x = "K Value",
       y = "Error Rate")