# Data downloaded from https://www.kaggle.com/datasets/zalando-research/fashionmnist

# Add file names 

train_dat_file = "fashion-mnist_train.csv" 
test_dat_file = "fashion-mnist_test.csv"

# Read data
fashion_mnist = read.csv(train_dat_file)
fashion_mnist_te = read.csv(test_dat_file)


# Keep only pullovers (2) and coats (4)
library(dplyr)

pull_and_coats = fashion_mnist %>% filter( label %in% c(2,4))
pull_and_coats_te = fashion_mnist_te %>% filter( label %in% c(2,4))

# Viewing function. 
view_image = function(k, dat = pull_and_coats[,-1],col_map = grey.colors(256)){
  im = dat[k,] 
  image(matrix(as.numeric(im), nc = 28)[,28:1]/256, col = col_map)
}

train_response = ifelse(factor(pull_and_coats[,1],) == 4, 1, 0)
train_feat = pull_and_coats[,-1]

test_response = ifelse(factor(pull_and_coats_te[,1],) == 4, 1, 0)
test_feat = pull_and_coats_te[,-1]

We will classify small images of clothes in 28x28 grey scale values.

The images are from the fashion MNIST dataset - https://www.kaggle.com/datasets/zalando-research/fashionmnist.

Fashion MNIST is intended as a drop-in replacement for the classic MNIST dataset often used as the “Hello, World” of machine learning programs for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc.) in a format identical to that of the articles of clothing.

Our goal is to build and evaluate classifiers for telling apart two classes: pullover as label 2 and coats label 4.

Question 1

We will compare two methods, each from a different classifier family such as Classification trees, Bayesian classifiers, linear discriminants, etc.

# Calculate the Principal Components
PCA_df = prcomp(train_feat, scale = TRUE)

PCA_features_importance = PCA_df$sdev^2 / sum(PCA_df$sdev^2)

# calculate total variance explained by each principal component

pca_var_exp = data.frame("n" = 1:length(cumsum(PCA_features_importance)), "Explained Variance" = cumsum(PCA_features_importance))

pca_var_exp_head_tail= data.frame(pca_var_exp[1:5,1:2],pca_var_exp[35:39,1:2])

knitr::kable(pca_var_exp_head_tail)

n	Explained.Variance	n.1	Explained.Variance.1
1	0.2843090	35	0.7959197
2	0.3915646	36	0.7993704
3	0.4593082	37	0.8027806
4	0.4992370	38	0.8061014
5	0.5289100	39	0.8093021

# plot

ggplot(data = pca_var_exp, aes(x = n, y = Explained.Variance))+
  geom_line(size = 1, col = "deepskyblue3")+ ggtitle(" Explained Variance - First 39 PCAs - 80%")+
  geom_point(inherit.aes = F, aes(x=10, y = round(Explained.Variance[39],1)), size = 3, alpha = 0.3, shape = 21, fill = 'darkseagreen2')+
  geom_hline(yintercept = 0.9, col = "darkorchid4", size = 0.75)+ scale_x_continuous(name="PCA ", limits=c(0, 50), breaks = c(0,10,20, 30, 46))+
  scale_y_continuous(name = "Ratio", limits = c(0.26,1), breaks = seq(0,1,0.1))+
  theme(panel.grid.minor = element_blank())

We have decided to use 80 % of the explain data to display the data.

Mathematically speaking, PCA performs a linear transformation moving the original set of features to a new space composed by principal components, hence we choose 39 of them. Using Principal Component Analysis helps improve the outcome of a classifiers:

#train based on PCA reduction 

train_data = as.data.frame(PCA_df$x) %>% select(PC1:PC39)


test_data = as.data.frame(predict(PCA_df, newdata = test_feat)) %>% select(PC1:PC39)

# add labels 2,4 as binary inputs

train_data$label = train_response ; test_data$label = test_response

#Decisions trees

First classifier - Random Forest:

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.

set.seed(123)

randomForest_analysis = randomForest(as.factor(label) ~ ., data = train_data)

randomForest_analysis_pred= predict(randomForest_analysis, newdata = test_data)

label_RF_pred= table(as.factor(test_response),randomForest_analysis_pred )

#cat( "The RF model accuracy is", mean( randomForest_analysis_pred== test_response))

#Decision Boundary

Second classifier - Logistic Regression

Logistic regression is a classification algorithm, used when the value of the target variable is categorical in nature. Logistic regression is most commonly used when the data in question has binary output, so when it belongs to one class or another, or is either a 0 or 1.

#set the data

# Fit the model
regression_labels = glm(label ~ ., data = train_data, family = binomial)

# Make predictions
regression_bin = regression_labels %>% predict(test_data, type = "response")

regression_bin_pred = ifelse(regression_bin > 0.5, 1, 0)

#cat( "The Logistic Regression model accuracy is", mean(regression_bin_pred == test_response))

Our decisions in specifying the penalty and any hyper-parameters, and in choosing the threshold. The first classification algorithm we chose is Logistic Regression as learned in class. Since the problem is prediction if a ppredicture is either pullover and coats , we have a binary prediction problem, which makes Logistic Regression the most natural and immediate method to pick 80% of the data.

But we also know from previous courses, is that glm function is a function for big of data frames (14,000 x 785 in total 2 sets) In order to make the GLM function to word better - we first perform PCA and check up how many PCs we want to use.

Question 2

Write your own function that calculates

the confusion matrix
the precision and
the recall and assume that the class pullover (2) is positive Calculate the values for both classifiers on the trained_dataing data and on the tested_data data.

Discuss the results, comparing the methods in terms of test data set results, and magnitude of overfitting.

manual_cunfusion_matrix = function(prediction, real){
  prediction = as.numeric(as.character(prediction))
  real = as.numeric(as.character(real))
  confusion_matrix = data.frame(matrix(0, 2, 3))
  colnames(confusion_matrix) = c("Label 2  ", "Label 4", "Error")
  row.names(confusion_matrix) = c("Label 2", "Label 4")
  for (i in 1:length(real)) { 
    if(real[i] == prediction[i]) {
      ifelse(real[i] == 1, confusion_matrix[2,2] <- confusion_matrix[2,2]+1,  confusion_matrix[1,1] <- confusion_matrix[1,1] + 1)}
    if(real[i] != prediction[i]) {
      ifelse(real[i] == 1, confusion_matrix[2,1] <- confusion_matrix[2,1] + 1 ,  confusion_matrix[1,2] <- confusion_matrix[1,2] +  1)}}
  confusion_matrix[1,3] = confusion_matrix[1,2]/sum(confusion_matrix[1,1:2])
  confusion_matrix[2,3] = confusion_matrix[2,1]/sum(confusion_matrix[2,1:2])
    Accuracy = sum(ifelse(prediction == real, 1, 0))/length(real)
  Recall = confusion_matrix[2,2]/sum(confusion_matrix[2,1:2])
  return(list("confusion matrix" = confusion_matrix, "Accuracy" = round(Accuracy,4), "Recall" = Recall))}


glm_calculates =  manual_cunfusion_matrix(regression_bin_pred, as.factor(test_response))
RF_calculates = manual_cunfusion_matrix(randomForest_analysis_pred, as.factor(test_response))


knitr::kable(glm_calculates$`confusion matrix`, caption = 'General Linear Model confusion matrix')

General Linear Model confusion matrix
	Label 2	Label 4	Error
Label 2	834	166	0.166
Label 4	99	901	0.099

Random Forest confusion matrix
	Label 2	Label 4	Error
Label 2	870	130	0.130
Label 4	77	923	0.077

knitr::kable(data.frame("Accuracy" = c(glm_calculates$Accuracy,RF_calculates$Accuracy),
                      "Recall" = c(glm_calculates$Recall, RF_calculates$Recall), row.names = c("Logistic Regression ", "Random Forest")))

	Accuracy	Recall
Logistic Regression	0.8675	0.901
Random Forest	0.8965	0.923

Confusion matrix used for solving classification problems. It can be applied to binary classification as well as for multiclass classification. Here, recall given the predicted labels from a model and the true labels.

regression_bin_pred_loristic_tr = ifelse(predict(regression_labels, train_data, type = "response") > 0.5, 1, 0)
regression_bin_pred_loristic_te = ifelse(predict(regression_labels, test_data, type = "response") > 0.5, 1, 0)

glm_calculates_train =  manual_cunfusion_matrix(regression_bin_pred_loristic_tr, as.factor(train_data$label))
glm_calculates_test =  manual_cunfusion_matrix(regression_bin_pred_loristic_te, as.factor(test_data$label))

#################################################################################
pred_RF_tr = predict(randomForest_analysis, train_data, type = "response")

pred_RF_test = predict(randomForest_analysis, test_data, type = "response")


RF_calculates_train =  manual_cunfusion_matrix(pred_RF_tr, as.factor(train_data$label))

RF_calculates_test =  manual_cunfusion_matrix(pred_RF_test, as.factor(test_data$label))

knitr::kable(data.frame("Accuracy" = c(round(glm_calculates_train$Accuracy,4), round(glm_calculates_test$Accuracy,4),
                                       round(RF_calculates_train$Accuracy,4),round(RF_calculates_test$Accuracy,4)),
                      "Recall" = c(round(glm_calculates_train$Recall,4), round(glm_calculates_test$Recall,4), 
                                   round(RF_calculates_train$Recall,4), round(RF_calculates_test$Recall,4)), 
                      row.names = c("Logistic Regression train", "Logistic Regression test", "Random Forest train", "Random Forest test")))

	Accuracy	Recall
Logistic Regression train	0.8499	0.8705
Logistic Regression test	0.8675	0.9010
Random Forest train	1.0000	1.0000
Random Forest test	0.8970	0.9240

Question 3

Write your own function that draws a response operating curve (ROC).

Draw ROCs on the test data data for both classifiers.

Show the point of the original threshold. Briefly discuss the results.

my_ROc_function = function(model="RF"){
  thresh = seq(0, 1, 0.1)
  Spec_Sens = matrix(nrow = length(thresh), ncol = 2)
  if (model == 'glm'){
    for (i in 1:length(thresh)) {
      log_m = glm(label ~ ., data = train_data, family = binomial)
      .class = ifelse(predict(log_m, test_data, type = "response") > thresh[i], 1,0)
      log_r = manual_cunfusion_matrix(.class, as.factor(test_data$label))$`confusion matrix`
      Spec_Sens[i,1]  = log_r[2,1]/(log_r[2,2] + log_r[2,1])
      Spec_Sens[i,2] = log_r[1,1]/(log_r[1,1] + log_r[1,2])}}
  else{ 
    for (i in 1:length(thresh)) {
      randomForest_analysis = randomForest(as.factor(label) ~ ., data = train_data)
      .class = ifelse(predict(randomForest_analysis, newdata = test_data, type = "prob") < thresh[i], 1,0)
      RF_res = manual_cunfusion_matrix(.class, as.factor(test_data$label))$`confusion matrix`
      Spec_Sens[i,1]  = RF_res[2,1]/(RF_res[2,2] + RF_res[2,1])
      Spec_Sens[i,2] = RF_res[1,1]/(RF_res[1,1] + RF_res[1,2])}}
  df_Spec_Sens = as.data.frame(Spec_Sens)
  colnames(df_Spec_Sens) = c( "Sensitvity", "Specificity")
  return(df_Spec_Sens)}

glm_roc = my_ROc_function('glm')

RF_roc = my_ROc_function('RF')

ROC_data = rbind(glm_roc, RF_roc)
model = rep(c('glm', 'RF'),times = c(11,11))
ROC_data = cbind(ROC_data, model)

ROC_plot <- ggplot(ROC_data, aes(y = Specificity, x = Sensitvity, color = model, group = model)) + geom_line(size = 1) + geom_rug() + geom_abline(slope = 0.8, intercept = 0, color = "seagreen4", size = 0.8 ) + xlab("X") + ylab("Y") +ggtitle("Response Operating Curve plot") + geom_point(aes(x=0.8,y=0.8),colour="palevioletred4", size= 3, shape = 21, fill = "palevioletred")

ROC_plot + scale_color_manual(values = c("orange", "plum"))

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

Question 4

For one of your classifiers, display four examples that were classified incorrectly for each class.

image_wrong_predictions_LR_train= train_data[which((regression_bin_pred_loristic_tr != train_data$label)),]

LR_train_image = row.names(image_wrong_predictions_LR_train)

list1 <- c()
for (i in LR_train_image) {
  if (train_data[i,40] == 1) {
    list1 <- c(list1, i)
  }
}


sample1=sample(1:length(list1),4) 


par(mfrow = c(2,2))
for ( i in 1:4){  view_image(LR_train_image[sample1[i]])}; title("Examples for coats that were classified incorrectly of Logistic Regression - train data", line = -15, outer = TRUE)

list2 <- c()
for (i in LR_train_image) {
  if (train_data[i,40] == 0) {
    list2 <- c(list2, i)
  }
}


sample2= sample(1:length(list2),4)

par(mfrow = c(2,2))
for ( i in 1:4){view_image(LR_train_image[sample2[i]])};title("Examples for pullovers that were classified incorrectly of Logistic Regression - train data", line = -15, outer = TRUE)

Can you see what made these examples hard for the classifier?

Can you see indications in the classifier what made these examples hard?

We displayed 4 sampled pictures for coats and 4 sampled pictures of pullovers using Logistic Regression on the train data in order to check why this classifier could not predict them.

We think that it was hard to classify due to intra-Class Variation, Scale Variation,View-Point Variation,Occlusion, Illumination and Background Clutter. Here, the colors could also be hard to identify.

Also, logistic regression is intended for binary (two-class) classification problems. It will predict the probability of an instance belonging to the default class, which can be snapped into a 0 or 1 classification.

Question 5

Do you expect each of your fitted classifiers to work well on this image.

Our models, random forest & logistic regression were trained by a set where the background is white and the shape is black.

Thus, is looks at the black color as the input. So the given ppredicture should get a bad result as a prediction, because although the human eye sees the shape of white closh on a dark background, the model will see an unknown shapes on a white ackground, which is neither close to 2 nor 4..

If all of the training data was changed to black background and white lines - the models would probably be even more accurate because there would be less gray-shades, and the ambiguous gray lines that we found as confusing in Q4 would not be so confusing then.