Face Recognition using Supervised Learning Techniques

11/26/2019

Introduction

In this presentation, we explore different approaches to solve the problem of face recognition using supervised learning techniques. As the presentation progresses, we distinguish between four alternative solutions to the problem, which themselves are ramifications of two well-established methods in the data science literature: Eigenfaces and Fisherfaces. For each of these alternatives, we describe both the structure and intentions of the code implemented. Finally, we provide code for a classification system which was trained on all available data, and parametrized according to results that we later analyze.

Note that this presentation is the continuation of a previous proyect. For more details vist http://rpubs.com/dherrero12/543854.

Index

Introduction
Data cleaning
Data handling: Eigenfaces and Fisherfaces
Classification: KNN and SVM
Hyperparameter optimization
Optimization results
Thresholding
Thresholding results
Function declaration

Data cleaning

We are given a data set containing 480 images of dimension 120x165, each of which belongs to one of 80 distinct individuals. It is obvious that these data cannot be directly used to perform any classification, and thus a certain degree of data cleaning in necessary.

library(bmp)
image <- read.bmp("M-005-01.bmp")
names <- list.files(pattern="bmp")
data  <- matrix(0, length(names), prod(dim(image)))
for (i in 1:length(names)) {
  im <- read.bmp(names[i])
  r  <- as.vector(im[,,1])
  g  <- as.vector(im[,,2])
  b  <- as.vector(im[,,3])
  
  data[i,] <- t(c(r, g, b))
}

Data cleaning

Since all of the picture files are uniquely identified, we will use this fact to transform each of the face images into a 1 x 59400 vector, which is easier to work with. For that we split each image into the R, G, and B compontents and concatenate these in a vector.

Data preparation

The first reasonable step in the implementation of a solution for our face recognition system is exploring the different routes that we could go on about implementing a solution. As mentioned in the introduction, the literature explores two main classification algorithms: Eigenfaces and Fisherfaces. In the following slides we will sum up these two methods and their implementations in R code.

Eigenfaces

The eigenfaces approach is a PCA-based technique widely used in face recognition systems. Its main objective is to find a set of features within a image dataset that together characterize the variation among image elements. In that way, each image can be represented as a linear combination of these features, the eigenfaces. The approach to the system’s creation involves the following:

Obtaining a suitable set of images: the training set.
Calculating the eigenfaces from the training set, keeping those with the higher eigenvalues.
Calculating the weights for each known individuals, by projecting their images onto the newly created eigenspace.

Eigenfaces

Below we describe each of the steps followed in the eigenfaces approach. Notice that these steps are already described in part one of this proyect, whcih can be found in the following link http://rpubs.com/dherrero12/543854.

1. Center and scale the data.

scaled <- scale(data, center=TRUE, scale=TRUE)

2. Compute the variance-covariance matrix.

The dispersion matrix cannot computed directly due to computational problems. In this case, we will calculate Sigma, which contains the same eigenvalues as the original dispersion matrix.

Sigma <- scaled%*%t(scaled) / (nrow(scaled)-1)

Eigenfaces

3. Calculate the non-null eigenvalues of the covariance matrix.

eig   <- eigen(Sigma)
eigenvalues  <- eig$values
eigenvectors <- eig$vectors

4. Choose the number of principal components.

Note that number of principal components to use in the eigenface projection is chosen heuristically in this case, but will later be chosen following an optimization argument.

cum.var  <- cumsum(eigenvalues) / sum(eigenvalues)
thres    <- min(which(cum.var > .95))

Eigenfaces

5. Compute the eigenfaces.

Since we need the eigenvectors of the original dispersion matrix, we multiply the Sigma by the transpose of the scaled data and correct for lenght.

scaling    <- diag(eigenvalues[1:thres]^(-1/2)) / (sqrt(nrow(scaled)-1))
eigenfaces <- t(scaled)%*%eigenvectors[,1:thres]%*%scaling

Eigenfaces

6. Compute the new dataset.

All we are left to do is to compute the new data set by projecting the original data into the newly found eigenspace.

data.proj <- scaled%*%eigenfaces

Eigenfaces

Once the system has been created, the following steps are taken in order to classify a new face image:

Calculating the weights of the new face by projecting it into the previously created eigenspace.
Determining whether the image corresponds to any of the individuals in the training set by means of a classifier.

Fisherfaces

Fisherfaces takes advantage of the fact that the data is labeled. By using label information, it builds specific linear model which proves to be more reliable than a PCA-only solution, which does not account for class variation. Fisherfaces, and in turn FDA, projects the data into directions in such a way that the ratio of between-class scatter and within-class scatter is maximized. The fisherfaces approach builds on top of the solution explored in eigenfaces and then involves the following:

Calculating the maximization vectors for the given data; that is, the one that maximizes the ratio of between-class scatter and within-class scatter.
Keeping those fishervectors with the higher eigenvalues.
Calculating the weights for each known individual, by projecting its image onto the newly created fisherspace.

Fisherfaces

Below we describe each of the steps followed in the fisherfaces approach, starting from the point at which we left of when describing the eigenfaces solution. Refer back, if necessary, to part one of this project.

1. Clean the data.

It is convinient to structure the data in a data frame where each data row is identified by its corresponding label. This way, we can exploit the fact that FDA uses class-specific information to compute a more reliable model of the data.

labeled.proj <- data.frame(E=data.proj,
                           label=rep(1:480, each=6))

Fisherfaces

2. Compute the scatter matrices.

We first compute the between- and within-classes means for each the 80 classes of data.

mean  <- colMeans(data.proj)
means <- sapply(levels(as.factor(rep(1:480, each=6))), 
                FUN = function(class) {
                  colMeans(labeled.proj[labeled.proj$label==class, 
                                        1:ncol(data.proj)])
                })

Fisherfaces

Then, we find the between- and within-classes scatter matrices required by the Fisher criterion.

SW <- matrix(0, ncol(data.proj), ncol(data.proj))
for (i in 1:80) {
    SW <- SW + cov(labeled.proj[labeled.proj$label==as.character(i), 
                                1:ncol(data.proj)])*
      (table(labeled.proj$label)[1]-1)
}

SB <- matrix(0, ncol(data.proj), ncol(data.proj))
for (i in 1:80) {
    SB <- SB + (table(labeled.proj$label)[1])*
      (means[,i]-mean)%*%t(means[,i]-mean)
}

Fisherfaces

3. Find the fisher projection.

Solving the Fisher criterion amounts to finding the direction of projection for the data which maximizes the separation between the class and also minimizes the variance inside each class. For, that, we only need to solve for the eigenvalues of the matrix below.

eig <- eigen(solve(SW+diag(400-1)*0.0001)%*%SB)
eigenvectors <- eig$vectors[,1:(80-1)]
eigenvalues  <- as.numeric(eig$values)

Notice that it was required to force the matrix to be non-singular, as we are only interested in non-zero eigenvalues of the Fisher projection, that is the first 79 of them.

Fisherfaces

Also, after computing the eigenvalues of the eigenfaces step, we select only the first 399 eigenvectors, and ommit the heuristic selection of certain number of them. The code for this step is implemented "under the hood" as to avoid loss of understandability.

4. Choose the number of Fisher vectors.

Note that number of vectors to use in the Fisher projection is chosen heuristically in this case, but will later be chosen following an optimization argument.

cum.var <- cumsum(eigenvalues) / sum(eigenvalues)
lda.thres <- min(which(cum.var >= 0.95))

Fisherfaces

5. Compute the new dataset.

All that is left is to project our PCA-reduced data in the direction where ratio of between-class scatter and within-class scatter is maximized, which was just found in previous slides.

data.lda <- apply(data.proj%*%eigenvectors[,1:lda.thres], 2, as.numeric)

Fisherfaces

Once the system has been created, the following steps are taken in order to classify a new face image:

Calculating the weights of the new face by projecting it into the previously created fisherspace.
Determining whether the image corresponds to any of the individuals in the training set by means of a classifier.

Classification

Once we have explored the different methods in which we can prepare the data given, we are set to find and apply a classification algorithm so that our face recognition system can function. That is, the next logical step to our solution is exploring different classification methods to be used by our system when making recognition decitions. In this presentation, only two classifiers are presented: k-nearest neighbours (KNN) and support vector machines (SVM), both of which are well established among data-science practitioners. It is important to note, however, that there exist other valid approaches, which may or may not be based on KNN or SVM approaches.

KNN

KNN is a supervised learning classifier, that is one that relies on data labels to learn a function that produces a desired output. In particular, KNN assumes that similar data coexist in close proximity, and applies this knowledge to deduce in which area of similarity new data lies. The algorithm follows a series of steps when deciding the identity of new data, which consist of the following:

Initializing a k number of neighbours.
Calculating the distance between our labeled and unlabeled data.
Ordering the resulting distances in ascending order, then identifying the k-th closest data points to the new data.
Calculating the mode identity of said neighbours.

KNN

Below we define a function that implements every step required by a KNN classifier. Note that this function takes as arguments a partition of the data (train.proj, and valid.proj respectively) along with its corresponding labels, as well as values for k and the distance method used by the algorithm. The function then returns the accuracy of the algorithm, assuming that the valid.proj set acts as unlabeled data.

knn <- function(train.proj, valid.proj, type, k, train.labels, valid.labels) {
  temp <- rep(0, nrow(train.proj))
  resp <- rep(0, nrow(valid.proj))
  for (i in 1:nrow(valid.proj)) {
    for (j in 1:nrow(train.proj)) {
      temp[j] <- distance(valid.proj[i,], train.proj[j,], type)
    }
    ...

KNN

    ...
    k.neig  <- train.labels[order(temp)[1:k]]
    resp[i] <- names(which.max(table(k.neig)))
    acc <- sum(resp == valid.labels)
    return(acc)
  }
}

This is in fact the form in which we want our function defined, since it will only be used in the parameter optimization step when finding maximizing values for k and distance.

SVM

SVM takes labeled datapoints and outputs an hyperplane that best separates the data tags in such a way that it maximizes the margin between different classes, assuming that these data are lineraly separable. When data are not lineraly separable by tags, a kernel trick is applied which projects the data into a higher dimension where separability may be achieved. Refer to the following link if you wish to get a better understanding of SVM's specifics: https://www.datacamp.com/community/tutorials/support-vector-machines-r.

The function svm() from the e1071 package serves our purposes of classification, since it will only be used in the parameter optimization step when fining optimal values for gamma and cost, which are responsible of the behavior of the kernel applied to our data and the SVM separating hyperparameter, respectively.

SVM

svmachines <- function(train.proj, valid.proj, gmma, cst, train.labels, valid.labels) {
  labeled.train <- data.frame(E=train.proj,
                                label=train.labels)
  
  classifier <- svm(label~., data=labeled.train, type='C-classification',
                    kernel="radial",
                    gamma=gmma, cost=cst,
                    probability=TRUE)
  classification <- predict(classifier, valid.proj, probability=TRUE)
    
  acc <- sum(diag(table(classification, valid.labels)))
}

Above we define a function that implements a SVM-based classifier. Note that this function takes as arguments a partition of the data (train.proj, and valid.proj respectively) along with its corresponding labels, as well as values for gamma and cost used by the algorithm.

SVM

The function then returns the accuracy of the algorithm, assuming that the valid.proj set acts as unlabeled data. Note that only one the two classifiers presented in this report will be implemented when undergoing the thresholding step, which aims to add provisions to our system so that it can recognize impostor images.

Hyperparameter optimization

Any of the combinations of data preparation algorithm and classifier could be valid, in principle, for our recognition purporses. That is, any of the four possible combinations might produce an acceptable level of accuracy when classifying our images.

Eigenfaces + KNN
Fisherfaces + KNN
Eigenfaces + SVM
Fisherfaces + SVM

However, we are interested in finding the best-performing combination. For that, we need to evaluate to some degree the performance of the four different approaches. We yet face another problem: each of the combinations depend on parameters which directly affect performance.

Hyperparameter optimization

In other words, certain selections of these parameters might cause the system to underperform. Fortunately, we can tackle both problems at once if we apply what is known among data scientist as hyperparameter optimization.

Choosing the value of parameters by our classifier system heuristically is not statistical convenient for several reasons, the first of which being that it does not translate to the different distance types, since they present different magnitudes. Also, if we train our system without testing it for already seen faces, we are unable to detect false negatives, which could result in worse model accuracy (the overfitting problem).

Hyperparameter optimization

A much better approach to select the best value for method paramaters is hyperparameter optimization, which has a robust statistical background. Several steps are taken:

Split the data into train and validation, as to avoid overfitting.
Train the model.
Test the model on a sequence of values for each of the parameters to be found, and select the one which yields higher accuracy in classification.

Hyperparameter optimization

In order to optimize the different method parameters, we first need to decide on cross validation algoritm, as to provide more statistical robustess to the selection process. K-fold cross-classification is the selected option in this presentation, since it generally gives the most accurate estimates of test error when compared against other validation methods. However, other approaches are also valid.

In the following slides, we explore a pseudo-implementation of hyperparamater optimization which can be used to build the different chunks of code for the four combinations. The actual code that we used to find the maximizing parameters for each of these combinations is provided in an R file attached to this report.

Hyperparameter optimization

1. Split the data into train and test.

The data is conviniently split into both train and validation sets according to a certain index, which in practice is defined before calling each fold process, but which is not explicitly specified here. However, it is common practice to allocate 80% of all available data in the train set, which placing the rest in the validation set

  train.scaled <- scale(data[inx,], center=TRUE, scale=TRUE)
  valid.scaled <- scale(data[idx,],
                        center=attr(train.scaled, "scaled:center"),
                        scale=attr(train.scaled, "scaled:scale"))
  train.labels <- rep(1:80, each=6)[-idx]
  valid.labels <- rep(1:80, each=6)[idx]

Hyperparameter optimization

2. Prepare the data.

The next step is to prepare our data. We can apply an eigenfaces argument, or rather follow a fisherfaces argument.

3. Classify according to the different parameter combinations.

We first create a data structure able to hold accuracy scores for each parameter combinations and then optimize by looping through all possible combinations for a given data split.

accuracy <- array(0, dim=c(length(var), length(par1), length(par1)),
                  dimnames=list(var, par1, par1))

Hyperparameter optimization

  for (i in 1:length(var)) {
    <variance split>
    
    cl <- makeCluster(detectCores()-1)
    registerDoParallel(cores=cl)
    accuracy[i,,] <- foreach(gmma=iter(gmma), .combine=cbind) %:% 
                       foreach(cst=iter(cst), .combine=c) %dopar% {

                          classifier(train.proj, valid.proj, 
                                     par1, par2, 
                                     train.labels, valid.labels)
                       }
    stopCluster(cl)
  }

Note that in practice, we have used parallel programming packages as shown above achieve higher computing frequency.

Hyperparameter optimization

Let us make a few remarks regarding hyperparameter optimization of our model. First, there was only one parameter relating data preparation methods: the percentage of variance that we wanted to keep when selecting eigenvectors in either PCA or LDA. Notice that this parameter was shared by the two methods. Thus, we had to include this eigenvector selection algorithm (variance split in the blueprint) inside the firt optimization loop.

The remaining parameters (par1 an par2 in the blueprint) concerned the classification method in use, and nothing else. We have already mentioned these optimization parameters in previous slides. However, we will point out an important differentation:

Hyperparameter optimization

KNN is dependant on the number of neighbours selected as well as the distance method employed in its implementation, and these should be taken as values for par1 and par2 when applying the code blueprint.
SVM is dependant on the values of gamma (kernel-related) and cost (hyperplane-related) uniquely, and these should be taken as values for par1 and par2 when applying the code blueprint.

In any way should this classifier specific parameters be confused or mixed up, since they do not live outside the scope of the classification algorithm they belong to.

Hyperparameter optimization

We are ready to start the optimization process. Let us define the work done by the optimization algorithm in one loop as to understand fully how the best parameters is selected. Below is a blueprint of what the structure of the code should look like.

1. Choose number of folds.

In this case, we will train the model for each of the six main gestures that the every individual in the picture database displays. Then we will optimize on one of the gestures as to obtain a system capable of detecting different facial expressions.

n.fold <- 6

Hyperparameter optimization

2. Define the parameters to optimize.

Below we show what out paramater choice for SVM optimization looks like. There is no easy appoach as to guess which combination of parameters will yield the best accuracy. Since the svm() function provided by the e1071 packages makes guesses on these parameters' values, when not specified, one can get an approximation of what the scale of maximizing parameters is.

gamma <- seq(0.005, 0.01, length.out=20); cost <- 1:5; 
variance <- seq(0.80, 1, 0.02)

Hyperparameter optimization

3. Optimize for all combinations.

Appropiate data structures are prepared to store the selections made by the classifier in each of the six validation folds.

acc <- array(0, dim=c(length(variance), length(par1), length(par2), n.fold),
             dimnames=list(variance, par1, par2, seq(1,n.fold)))
for (i in 1:n.fold) {
  ind.valid <- seq(i, 480, 6)
  acc[,,,i] <- optimize(data=data, ind.valid=ind.valid, var=variance, 
                        par1=par1, par1=par1)
}

At the end of each fold, the corresponding accuracy is calculated and stored in the corresponding slot.

Hyperparameter results

Let us plot our results as to obtain further insight on which is the best combination of parametersis for each of the combinations of data preparation method and classifier.

Hyperparameter results

Eigenfaces + KNN

Hyperparameter results

Fisherfaces + KNN

Hyperparameter results

Eigenfaces + SVM

Hyperparameter results

Fisherfaces + SVM

Hyperparameter results

As can be noted from the previous, when the fisherfaces approach is employed, high and consistent accuracies are yield, while the eigenfaces argument generally produces lower accuracy scores. Thus, the better projection is the one achieved by the Fisher criterion. Now the question is whether to select KNN or SVM as our systems classifier.

The results obtained by both classifiers are pretty outstanding. However, it can be noted from the plots that KNN yields consistently higher scores than does SVM, thus that is the final combination that we are gonna employ in the last step of our face recognition system creation: thresholding. For later reference, the combination chosen in the rest of the code was:

Number of neighbours: 1
Distance method: angle

Thresholding

All there is left to do before defining the final face recognition system is to provide statistical provisions to our model so that it is capable of identifying which images belong to impostors and which belong to labeled individuals. For that, we must apply a thresholding technique which yields a value used by the classifier to detect disparities between individuals recognized by the system and impostors.

Upon recognition, many biometric systems score the "impostor level" of each individual image by comparing it to all other images recognized by the system. If this scores surpasses some threshold level, then the individual to which that image belongs is labeled as impostor. We will follow a similar argument: images will not be recognized by the system if the minimum distance between the target image and the ones contained in the system's database is above some threshold value.

Thresholding

Our choice of thresholding technique is one based on the concept of optimizing equal error rate of classification. That is, we are going to select the threshold value which minimizes the difference between false acceptance rate (FAR), or ratio of images which are falsely recognized by the system, and false rejection rate (FRR), or tatio of images which are falsely labeled as impostor by our system, from a set of parameters. This method is closely related to hyperparameter optimization, but some notable differences must be noted.

The method employs 40-fold cross validation, by allocating the set of images belonging to a male and the set of images belonging to one female into the validation set.
The classifier no longer returns accuracy score, but rather the FAR and FRR ratios for each of the 40 folds. The classifier is also able to classify some of the images as impostors.

Thresholding

Next we provide a blueprint for a valid thresholding function. You can refer to the code attached to this file if you wish to learn about our own implementation of the thresholding step. Note that the majority of steps are explained in previous slides.

1. Split the data into train and test.

train.scaled <- scale(data[-ind.valid,], center=TRUE, scale=TRUE)
valid.scaled <- scale(data[ind.valid,],
                      center=attr(train.scaled, "scaled:center"),
                      scale=attr(train.scaled, "scaled:scale"))
  
train.labels <- rep(1:80, each=6)[-ind.valid]
valid.labels <- rep(1:80, each=6)[ind.valid]
valid.labels[1:12] <- 0

Thresholding

2. Prepare the data.

The next step is to prepare our data. We apply an fisherfaces argument up to the point where the algorithm selects the eigenvectors from the variance parameter.

3. Classify according to the best combination.

We first create a data structure able to hold ratio scores for each of the threshold and then optimize by looping through these and classifying according the the maximizing parameters found in the hyperparameter optimization step.

Thresholding

ROC <- data.frame(FAR=0, FRR=0)
sequence <- round(seq(min, max, length.out=200), 4)
  
cl <- makeCluster(detectCores()-1)
registerDoParallel(cores = cl)
  
acceptance <- foreach(sequence = iter(sequence), .combine=rbind) 
%dopar% {
    ROC <- knn(train.lda, valid.lda, method, neighbour, 
               train.labels, valid.labels, thr=sequence)
    ROC$threshold <- sequence
}

Note that the implementation of impostor detection lives within the classifier itself. Also, the min and max values define the range of threshold values that we are going to find rates for in each of the folds.

Thresholding resuls

After computing the 40-fold optimization procedure, we notice the existance of fair separability between recognition rates for impostors and those for labeled individuals. Thus, we can choose a distance for which this separability is maximized. In other words, we choose the threshold value corresponding to the equal error rate point. By plotting the results, we can see that such equilibrium is achieved when the threshold takes the value -0.9116.

Thresholding resuls

Function definition

Finally, we design our final classifier choosing the optimal parameter we just found. That is, the optimal values for threshold, variance, number of neighbours and distance method. For that, it is best if we find the space projection of all our data, so as to train our classifier system as much as possible. We just have to prepare the whole data set following a Fisher approach and feed this data, along with other required parameters into our final function The function will then project new input data into this fisherspace and then classify new input images according to whether or not the individual represented by such image belongs to the system's database. The function will return the corresponding 1 to 80 label if the new image belongs to any of the face classes from the training set, or a 0 if this is not the case.

Function definition

We apply our Fisherfaces approach to all data, so that we can pass it into the optimal classifier. Note that the code for most of this process is implemented "under the hood", since its inclusion would introduce redundancy.

write.csv(x=attr(scaled, "scaled:center"), file="LDA_KNN_mean.csv", sep=",", dec=".")
write.csv(x=attr(scaled, "scaled:scale"), file="LDA_KNN_scale.csv", sep=",", dec=".")
write.csv(x=eigenfaces, file="LDA_KNN_eigenfaces.csv", sep=",", dec=".")
write.csv(x=eigenvectors, file="LDA_KNN_fisherfaces.csv", sep=",", dec=".")
write.csv(x=train.lda, file="LDA_KNN_train.csv", sep=",", dec=".")

Function definition

The functionality of the function in the next slide, representing our final system, is explained in previous sections of this report. The function itself along with the appropiate parameters is provided on a different R-file. Please be aware that one would need to run the distance function blueprint found in the same document in order for the classifier to produce results.

Function definition

recognize <- function(test, train, mean, scale, eigenfaces, fisherfaces) {
  
  test.proj <- scale(test, mean, scale)%*%eigenfaces
  test.lda  <- test.proj%*%fisherfaces
  train.labels <- rep(1:80, each=6)
  
  temp <- rep(0, nrow(train))
  resp <- rep(0, nrow(test.lda))
  for (i in 1:nrow(test.lda)) {
    for (j in 1:nrow(train)) {
      temp[j] <- distance(test.lda[i,], train[j,], "angle")
    }
    k.neig  <- train.labels[order(temp)[1]]
    if (temp[order(temp)][1] > -0.9116) {
      resp[i] <- 0
    } else {
      resp[i] <- names(which.max(table(k.neig)))
    }
  }
  return(resp)
}