Consider the well-known zipcode data set in the machine learning and data mining literature, which are available from the book website: http://www-stat.stanford.edu/ElemStatLearn. You can also find it at Canvas: the training data set is the file “zip.train.csv” and the testing dataset is “zip.test.csv”. In the zipcode data, the first column stands for the response (Y ) and the other columns stand for the independent variables (Xi’s). The detailed description can be found from http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt Here we consider only the classification problem between 2’s and 7’s, e.g., denote by “ziptrain27” the training data that only includes the data when Y = 2 or when Y = 7.
ziptrain <- read.table(
file = paste("C:/Users/aryal/OneDrive - Georgia Institute of Technology/",
"Georgia_tech_courses/ISYE_7406/HW1/zip.train.csv",
sep = ""),
sep = ","
)
ziptrain27 <- subset(ziptrain, ziptrain[,1]==2 | ziptrain[,1]==7);
head(ziptrain27,2);
It is evident that this data collection contains 257 columns and 1376 observations in total. The first column is the response variable, while the other 256 columns are independent variables. Let’s check how many observations corresponding to Y=2 or Y=7 because we use only those values for training data.
dim(ziptrain27);
## [1] 1376 257
sum(ziptrain27[,1] == 2);
## [1] 731
sum(ziptrain27[,1] == 2);
## [1] 731
sum(ziptrain27[,1] == 7);
## [1] 645
This dataset only includes examples where Y = 2 or Y equals 7, as mentioned previously. That is, there are 645 cases with Y=7 and 731 cases with Y=2.
head(summary(ziptrain27),5);
head(round(cor(ziptrain27),2),5);
This code allows for the viewing of summary data and the correlation between the variables. Only this is mentioned in code, to save space.
rowindex = 55; ## You can try other "rowindex" values to see other rows
ziptrain27[rowindex,1];
## [1] 2
Xval = t(matrix(data.matrix(ziptrain27[,-1])[rowindex,],byrow=TRUE,16,16)[16:1,]);
image(Xval,col=gray(0:1),axes=FALSE) ## Also try "col=gray(0:32/32)"
Xval = t(matrix(data.matrix(ziptrain27[,-1])[rowindex,],byrow=TRUE,16,16)[16:1,]);
image(Xval,col=gray(0:32/32),axes=FALSE) ## Also try "col=gray(0:32/32)"
# Check the dimensions of the subsetted data
dim(ziptrain27); # Expected output: 1376 257
# Count the number of instances where the class label is 2
sum(ziptrain27[, 1] == 2);
# Summary statistics for the dataset
summary(ziptrain27);
# Compute and round the correlation matrix for the features
cor_matrix <- round(cor(ziptrain27), 2);
print(cor_matrix);
## 3. Visualizing a Digit Image
# Set the row index to visualize
rowindex <- 5; # Change this value to visualize other rows
# Check the class label of the selected row
print(ziptrain27[rowindex, 1]);
# Extract the pixel values and reshape them into a 16x16 matrix
Xval <- t(matrix(data.matrix(ziptrain27[, -1])[rowindex, ], byrow = TRUE, 16, 16)[16:1, ]);
# Display the image of the digit using grayscale
image(Xval, col = gray(0:1), axes = FALSE); # Try "col = gray(0:32/32)" for more detail
Build the classification rule by using the training data “ziptrain27” with the following methods: (i) linear regression; and (ii) the KNN with k = 1, 3, 5, 7, 9, 11, 13, and 15. Find the training errors of each choice.
Consider the provided testing data set, and derive the testing errors of each classification rule in (3).
Cross-Validation. The above steps are sufficient in many machine learning or data mining questions when both training and testing data sets are very large. However, for relatively small data sets, one may want to do further to assess the robustness of each approach. One general approach is Monte Carlo Cross Validation algorithm that splits the observed data points into training and testing subsets, and repeats the above computation B times (B = 100 say). In the context of this homework, we can combine n1 = 1376 training data and n2 = 345 testing data together into a larger data set. Then for each loop b = 1, · · · , B, we randomly select n1 = 1376 as a new training subset and use the remaining n2 = 345 data as the new testing subset. Within each loop, we first build different models from “the training data of that specific loop” and then evaluate their performances on “the corresponding testing data.” Therefore, for each model or method in part (3), we will obtain B values of the testing errors on B different subsets of testing data, denote by T Eb for b = 1, 2, · · · , B. Then the “average” performances of each model can be summarized by the sample mean and sample variances of these B values: \[ \overline{TE}^* = \frac{1}{B} \sum_{b=1}^{B} TE_b \quad \text{and} \quad \widehat{\text{Var}}(TE) = \frac{1}{B-1} \sum_{b=1}^{B} \left(TE_b - \overline{TE}^*\right)^2. \] . Compute and compare the “average” performances of each model or method mentioned in part (2). In particular, based on your results, write some paragraphs to provide a brief summary of what you discover in the cross-validation, including reporting the “optimal” choice of the tuning parameter k in the KNN method, and explaining how confident you are on the usefulness of your optimal choice in real-world applications.
mod1 <- lm( V1 ~ . , data= ziptrain27);
pred1.train <- predict.lm(mod1, ziptrain27[,-1]);
y1pred.train <- 2 + 5*(pred1.train >= 4.5);
mean( y1pred.train != ziptrain27[,1]);
## [1] 0.0007267442
## KNN
library(class);
kk <- 1;
xnew <- ziptrain27[,-1];
ypred2.train <- knn(ziptrain27[,-1], xnew, ziptrain27[,1], k=kk);
mean( ypred2.train != ziptrain27[,1])
## [1] 0
ziptest <- read.table(file="C:/Users/aryal/OneDrive - Georgia Institute of Technology/Georgia_tech_courses/ISYE_7406/HW1/zip.test.csv", sep = ",");
ziptest27 <- subset(ziptest, ziptest[,1]==2 | ziptest[,1]==7);
dim(ziptest27) ##345 257
## [1] 345 257
xnew2 <- ziptest27[,-1]; ## xnew2 is the X variables of the "testing" data
kk <- 1; ## below we use the training data "ziptrain27" to predict xnew2 via KNN
ypred2.test <- knn(ziptrain27[,-1], xnew2, ziptrain27[,1], k=kk);
mean( ypred2.test != ziptest27[,1]) ## Here "ziptest27[,1]" is the Y response of the "testing" data
## [1] 0.0173913
zip27full = rbind(ziptrain27, ziptest27) ### combine to a full data set
n1 = 1376; # training set sample size
n2= 345; # testing set sample size
n = dim(zip27full)[1]; ## the total sample size
set.seed(7406); ### set the seed for randomization
### Initialize the TE values for all models in all $B=100$ loops
B= 100; ### number of loops
TEALL = NULL; ### Final TE values
for (b in 1:B){
### randomly select n1 observations as a new training subset in each loop
flag <- sort(sample(1:n, n1));
zip27traintemp <- zip27full[flag,]; ## temp training set for CV
zip27testtemp <- zip27full[-flag,]; ## temp testing set for CV
TEALL = rbind( TEALL, cbind(te0, te1, te2, te3, te4, te5, te6, te7, et8) ); } ### Of course, you can also get the training errors if you want dim(TEALL); ### This should be a Bx9 matrices ### if you want, you can change the column name of TEALL colnames(TEALL) <- c(“linearRegression”, “KNN1”, “KNN3”, “KNN5”, “KNN7”, “KNN9”, “KNN11”, “KNN13”, “KNN15”); ## You can report the sample mean/variances of the testing errors so as to compare these models apply(TEALL, 2, mean); apply(TEALL, 2, var);