Introduction

In this report, we expain in throurough deatil the process we followed to create a face recognition system based on a eigenfaces approach. For that, we explore the challenges we faced in the creation of the system as well as the different solutions we implemented when tackling these problems. At the end of the report, we provide the code for a classifier trained on the available data that can serve our face recognition objective.

Eigenfaces

The eigenfaces approach is a PCA-based technique widely used in face recognition systems. Its main objective is to find a set of features within a image dataset that together characterize the variation among image elements. In that way, each image can be represented as a linear combination of these features, the eigenfaces. The approach to the system’s creationinvolves the following:

  1. Obtain a suitable set of images: the training set.
  2. Calculate the eigenfaces from the training set, keeping those with the higher eigenvalues.
  3. Calculate the weights for each known individuals, by projecting their images onto the newly created eigenspace.

Once the system has been created, the following steps are taken in order to classify a new face image:

  1. Calculate the weights of the new face by projecting it into the previously created eigenspace.
  2. Determine whether the image corresponds to any of the individuals in the training set.

Preparing the dataset

We were given a data set containing 480 images of dimension 120x165, each of which belonged to one of 80 distinct individuals. It was obvious that this data could not be directly used to perform PCA, and thus a certain degree of data cleaning was necessary.

library(bmp)
library(OpenImageR)
image <- read.bmp("M-005-01.bmp")
imageShow(image)

Since all of the picture files are uniquely identified, we will use this fact to transform each of the face images into a 1 x 59400 vector, which is easier to work with. For that we split each image into the R, G, and B compontents and concatenate these in a vector.

names <- list.files(pattern = "bmp")

data <- matrix(0, length(names), prod(dim(image))) 
for (i in 1:length(names)) {
  im <- read.bmp(names[i])
  r  <- as.vector(im[,,1])
  g  <- as.vector(im[,,2])
  b  <- as.vector(im[,,3])
  
  data[i,] <- t(c(r, g, b))
}

faces <- data.frame(labels = factor(rep(1:80, each = 6)),
                    x = data)

At the end we have a new dataset of 480 and (120 x 165 x 3) + 1 = 59401 variables, since we have also included the label indicating the individual to which each image corresponds. This label variable, however, will be of no use when computing the eigenfaces from the training set.

In order to better calculate some of the hyperparameters that will later be used for classification, we will leave some of the individuals faces out of the training set, so that they can be later classified as unkown by our face recognintion system.

indices <- c(13:18, 79:84, 271:276, 307:312, 421:426, 475:480)
unseen <- faces[indices,]
faces <- faces[-indices,]
labels <- faces$labels
faces$labels <- NULL

Note that the choice of excluding these individuals and no others from the training set was arbitrary.

Computing eigenfaces

In order to compute the eigenfaces, we proceed as if we were applying PCA to any other dataset. The eigenvectors that we obtain from this analysis is what we refer to as eigenfaces, since when plotted, they have a face-like appearance.

1. Center and scale the data.

The center factor of the scale function is known as mean face. We want to keep these factor as well as the scale argument, since they will later be used when projecting new data into the existing eigenspace.

scaled <- scale(faces, center = TRUE, scale = TRUE)
mean.face <- attr(scaled, "scaled:center")
std.face  <- attr(scaled, "scaled:scale")

2. Calculate the covariance matrix.

The dispersion matrix cannot computed directly due to computational problems. In this case, we will calculate Sigma_, which contains the same eigenvalues as the original dispersion matrix.

Sigma_ <- scaled%*%t(scaled) / (nrow(scaled)-1)

3. Calculate the non-null eigenvalues of the covariance matrix.

eig          <- eigen(Sigma_)
eigenvalues  <- eig$values
eigenvectors <- eig$vectors

4. Choose the number of principal components.

Note that number of principal components to use in the eigenface projection is chosen heuristically.

prop.var <- eigenvalues / sum(eigenvalues)
cum.var  <- cumsum(eigenvalues) / sum(eigenvalues)
thres    <- min(which(cum.var > .95))

We can visualize the proportion of variance explained by each of the components by means of both a scree plot and a cummulative screen plot.

## Warning: package 'ggpubr' was built under R version 3.5.2
## Loading required package: ggplot2
## Loading required package: magrittr

5. Compute the eigenvectors.

Since we need the eigenvectors of the original variance-covariance matrix, we need to apply the appropiate transformation to those obtained when using Sigma_. For that, we multiply the later by the transpose of the scaled data, and correct for unit length by the scale factor shown above.

scaling    <- diag(eigenvalues[1:thres]^(-1/2)) / (sqrt(nrow(scaled)-1))
eigenfaces <- t(scaled)%*%eigenvectors[,1:thres]%*%scaling

By plotting the resulting eigenvectors, we can better understand the meaning of the term eigenfaces in the existing literature. In this case, only the second eigenface was plotted.

eigenface <- array(eigenfaces[,2], dim(image))
imageShow(eigenface)

6. Compute the new dataset.

All we are left to do is to compute the new data set by projecting the original data into the newly found eigenspace.

  data.new <- data.frame(labels = labels,
                         x = t(t(eigenfaces)%*%t(scaled)))

Classification

The classification of a new image will consist in identifying this as belonging to an individual from the training set or not. The process that the system will follow in achieving this classification is in principle veryvery intuitive. The procedure can be described using a set of clear step.

1. Image normalization.

The image is projected into the eigenspace by a very simple operation. The resulting vector contains the weights that each of the feature eigenfaces have on the original image.

projection <- data.frame(labels = unseen$labels,
                         x = t(t(eigenfaces)%*%t(scale(unseen[,c(2:ncol(unseen))], 
                                      center = mean.face, 
                                      scale = std.face))))

2. Distance computation.

This vector is then used to find to which of the individuals, if any, the face belongs to. For this, we find the distance between the class average of the existing faces (the average features of the weights of faces belonging to the same individual) and the projected weights of the new face.

Let us first create a distance function so as to provide different ways of computing the distance between two observations in the eigenspace. Below, we include three different distance finding algorithms: Euclidean (type=1), Mahalanobis (type=2), and Manhattan (type=3).

distance <- function(x1, x2, type = 1) {
  x1 <- matrix(x1, length(x1), 1)
  x2 <- matrix(x2, length(x2), 1)
  
  if (type == 1) {
    sqrt(t(x1 - x2)%*%(x1 - x2))
  
  } else if (type == 2) {
    m <- x1/sqrt(eigenvalues[1:length(x1)])
    n <- x2/sqrt(eigenvalues[1:length(x1)])
    sqrt(t(m - n)%*%(m - n))
  
  } else if (type == 3) {
    sum(abs(x1 - x2))
  }
}

Note that computing the Mahalanobis distance in the eigenspace only requires to correct each of the weights for their correspoinding eigenvalues as to allow variations along all eigenspace axes to be treated as equally significant.

We are now ready to compute the error matrix containing the distance from the new data image to each of the face classes.

classes <- aggregate(data.new, list(data.new$labels), mean)
  
class.lb <- classes[,1]
proj.lb  <- projection$labels
classes[,c(1,2)] <- NULL
projection[,1] <- NULL
classes <- as.matrix(classes)

error <- matrix(0, nrow(classes), nrow(projection))
for (j in 1:nrow(projection)) {
  for (i in 1:nrow(classes)) {
    error[i,j] <- distance(unlist(projection[j,]), classes[i,], 2)
  }
}

colnames(error) <- proj.lb
rownames(error) <- class.lb
head(error)
##          3        3        3        3        3        3       14       14
## 1 12.32978 12.79012 12.78186 13.48874 13.26093 13.17320 10.54692 11.30028
## 2 12.62754 13.06006 13.34214 13.54774 13.29694 14.00910 10.84751 12.20698
## 4 13.34726 13.76854 13.40843 14.34450 14.04755 14.04779 11.74519 12.42161
## 5 12.79415 13.14961 13.01197 13.78932 13.31677 13.26919 10.12553 10.91052
## 6 12.57068 12.31662 13.34935 12.99044 12.58876 13.48521 10.78980 11.63154
## 7 12.46074 12.71084 13.54322 13.16715 12.91295 13.54993 10.66501 11.46684
##         14       14       14       14       46       46       46        46
## 1 13.33813 10.68707 10.50566 13.34293 11.94451 11.87341 12.38611 10.775322
## 2 13.01547 11.31687 11.40723 13.00838 11.99501 12.44213 12.85150 11.308628
## 4 13.74558 11.61098 11.64108 13.68999 11.18267 11.99901 12.39643 10.135640
## 5 13.17377 10.57594 10.73333 12.84472 12.04778 12.17512 12.70096 11.490831
## 6 13.51650 10.89274 11.23297 12.87099 11.11287 11.81226 12.33629  9.988747
## 7 13.22606 11.12094 10.85304 12.79411 11.87069 12.15101 12.32984 11.055561
##         46       46       52       52       52       52       52       52
## 1 11.65472 12.20008 13.72116 15.84325 13.84699 13.67225 14.95402 13.92800
## 2 12.08618 12.31084 14.00145 15.75846 14.07381 13.97253 14.41967 14.26680
## 4 11.36945 11.46983 14.05301 15.79123 13.04178 13.98535 14.70859 13.58580
## 5 12.19034 12.50445 14.28985 16.17474 14.10764 14.22521 14.80528 14.16896
## 6 11.07662 11.70158 14.03926 15.57586 14.11859 13.71727 14.09711 14.13774
## 7 11.49501 11.87331 14.10627 15.84102 14.01907 14.19328 14.40134 14.01778
##         71       71       71       71       71       71       80       80
## 1 12.79305 13.49534 14.74705 11.71046 12.37689 14.55939 11.61802 13.06050
## 2 13.26141 14.26949 14.94869 12.65002 13.30678 14.60595 11.97982 12.42806
## 4 13.89742 14.38578 15.66771 12.69221 13.77950 15.52129 12.47229 13.05420
## 5 13.72369 14.36272 15.40687 13.03139 13.81389 15.37246 12.29748 12.83598
## 6 12.32133 13.08722 14.64378 11.32938 12.77221 14.18697 12.31141 12.48997
## 7 13.18406 13.86023 14.90655 12.58188 13.42590 14.84158 11.98158 12.68885
##         80       80       80       80
## 1 13.56278 12.65781 12.30486 13.82319
## 2 13.26751 12.76633 12.25820 13.54963
## 4 13.37528 13.21670 13.05134 13.45016
## 5 13.99109 12.26516 11.38041 13.02040
## 6 13.03412 12.46758 12.06282 13.30521
## 7 13.87849 12.32319 12.04388 13.34234

Note that the Mahalanobis distance was used in this case.

3. Classification.

All there is left to do is classify the new image as belonging to some class or not. For that, we choose heuristically some threshold for which new images are classified as belonging to some class if the minimum error (minimum distance between any class and the new image) is below it, and classified as unknown otherwise.

threshold <- 8

accuracy <- 0
  prediction  <- vector("numeric", ncol(error))
  observation <- vector("numeric", ncol(error))
  for (i in 1:ncol(error)) {
    pred <- ""
    if (min(error[,i]) > threshold) {
      pred <- "0"
    } else {
      pred <- names(which(min(error[,i]) == error[,i]))
    }
    prediction[i] <- pred
    observation[i] <- "0"
    if (pred == "0") {
      accuracy <- accuracy + 1
    }
  }

rbind(prediction, observation)
##             [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## prediction  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"   "0"   "0"  
## observation "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"   "0"   "0"  
##             [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## prediction  "0"   "0"   "0"   "51"  "0"   "0"   "0"   "0"   "0"   "0"  
## observation "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"  
##             [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
## prediction  "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"  
## observation "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"   "0"  
##             [,33] [,34] [,35] [,36]
## prediction  "0"   "0"   "0"   "0"  
## observation "0"   "0"   "0"   "0"
accuracy/ncol(error)
## [1] 0.9722222

Note that 0 labels have been used for all never seen observations in this case.

Hyperparameter optimization

Choosing the value of the threshold heuristically is not statistical convenient for several reasons, the first of which being that it does not translate to the different distance types, since they present different magnitudes. Also, if we train our system without testing it for already seen faces, we are unable to detect false negatives, which could result in worse model accuracy.

A much better approach to select the best value for our threshold is hyperparameter optimization, which hast a robust statistical background. Several steps are taken:

  1. Split the data into train and test, as to avoid overfitting.
  2. Train the model.
  3. Test the model on a sequence of values for each of the parameters to be found, and select the one which yields higher accuracy classification.

1. Function definition.

We first need to define each of the different functions (PCA, classification, forecast) as to make it easier to call them when carrying out hyperparameter optimization. Notice that we implemented the distance function already in a previous section.

pca

Some changes were made to the PCA algorithm when defining this function, mostly related to the interplay between PCA and data classification. When clasifying a new face image, we first need to scale it in terms of the resulting eigenspace. For that we need a way to return the mean and scale factors used in PCA, as well as the values for the eigenfaces and eigenvalues. Using the attr, we can achieve this. The factors will be returned along with the new data set.

pca <- function(data) {

  labels <- data$labels
  data$labels <- NULL

  scaled <- scale(data, center = TRUE, scale = TRUE)
  mean.face <- attr(scaled, "scaled:center")
  std.face  <- attr(scaled, "scaled:scale")

  Sigma_ <- scaled%*%t(scaled) / (nrow(scaled)-1)

  eig          <- eigen(Sigma_)
  eigenvalues  <- eig$values
  eigenvectors <- eig$vectors

  prop.var <- eigenvalues / sum(eigenvalues)
  cum.var  <- cumsum(eigenvalues) / sum(eigenvalues)

  thres <- min(which(cum.var > .95))

  scaling    <- diag(eigenvalues[1:thres]^(-1/2)) / (sqrt(nrow(scaled)-1))
  eigenfaces <- t(scaled)%*%eigenvectors[,1:thres]%*%scaling

  data.new <- data.frame(labels = labels,
                         x = t(t(eigenfaces)%*%t(scaled)))
  attr(data.new, "mean")  <- mean.face
  attr(data.new, "scale") <- std.face
  attr(data.new, "eigenfaces")  <- eigenfaces
  attr(data.new, "eigenvalues") <- eigenvalues
  return(data.new)
}

classify

The function is defined with no important changes. Functionality is already explained in previous sections. However, we need to include provisions for threshold hyperparameter optimization. In order to achieve this, it is enough with adding a sequence between the maximum and minimum least distances between the eigenspace and the new face image. The threshold in the sequence achieven the greatest accuracy is then selected for our system, and returned by the function, along with the estimated accuracy.

classify <- function(train, test, type) {
  
  classes <- aggregate(train, list(train$labels), mean)
  
  class.lb <- classes[,1]
  test.lb  <- test$labels
  classes[,c(1,2)] <- NULL
  test[,1] <- NULL
  classes <- as.matrix(classes)

  error <- matrix(0, nrow(classes), nrow(test))
  for (j in 1:nrow(test)) {
    for (i in 1:nrow(classes)) {
      error[i,j] <- distance(unlist(test[j,]), classes[i,], type)
    }
  }
  
  colnames(error) <- test.lb
  rownames(error) <- class.lb
  
  min.err <- min(apply(error, 2, min))
  max.err <- max(apply(error, 2, min))
  
  threshold <- seq(min.err, max.err, by = (max.err - min.err)/200)
  acc.thres <- vector("numeric", length(threshold))
  for (j in 1:length(threshold)) { 
    accuracy <- 0
    
    for (i in 1:ncol(error)) {
      pred <- ""
      if (min(error[,i]) > threshold[j]) {
        pred <- "0"
      } else {
        pred <- names(which(min(error[,i]) == error[,i]))
      }
      if (pred == colnames(error)[i]) {
        accuracy <- accuracy + 1
      }
    }
    
    acc.thres[j] <- accuracy/ncol(error)
  }
  
  aux <- which(max(acc.thres) == acc.thres)[1]
  return(c(acc.thres[aux], threshold[aux]))
}

forecast

This newly created function serves as an auxiliary method for hyperparameter optimization of the three different distances: the Euclidean, Mahalanobis, and Manhattan distances. It also structures the classification results in a more friendly way.

forecast <- function(type) {
  parameters <- data.frame(train = classify(data.new, data.new, type),
                           test = classify(data.new, projection, type))
  
  return(parameters)
}

2. Optimization.

In order to optimize the distance and threshold parameters, we first need to decide on cross validation algoritm, as to provide more statistical robustess to the selection process. We decided on k-fold cross, since it generally gives the most accurate estimates of test error when compared against other validation methods.

Then, we are set to begin our hyperparameter optimization. Let us begin by establishing the number of repetitions and the number of ‘folds’ in each of these. Note that these parameters are chosen heuristically. We also have to restore the values of faces since they were affected by our early exploration of the face recognition problem.

n.rep  <- 5
n.fold <- 6

We then prepare the data structures the parameters selected in each of the repetitions are to be stored for later analysis. Two matrices with the appropiate dimensions (number of distance methods x number of repetitions) can serve to store accuracy. Equal structures could be use to store threshold values.

train.acc <- data.frame(rep = matrix(0, nrow = 3, ncol = n.rep),
                        distance = 1:3)
test.acc  <- data.frame(rep = matrix(0, nrow = 3, ncol = n.rep),
                        distance = 1:3)

train.thres <- data.frame(rep = matrix(0, nrow = 3, ncol = n.rep),
                        distance = 1:3)
test.thres  <- data.frame(rep = matrix(0, nrow = 3, ncol = n.rep),
                        distance = 1:3)

We are ready to start the optimization process. Let us define the work done by the optimization algorithm in one loop as to understand fully how the best parameter is selected. Below is a detailed step by step explanation of the optimization procedure:

For each repetition, 1. Appropiate data structures are prepared to store the selections made by the classifier in each of the cross validation ‘folds’. 2. A set of all faces is selected as unseen, and will not be part of the system training in order to simulate what would happen to the model if feeded never seen pictures. We set all labels of these set to 0. 3. Image data is randomly reordered as to add arbitrarity to the optimization procedure.

For each fold, 1. Remaining image data is split into train and set, and the unseen set is included in the latter. 2. PCA is performed on the training set, thus obtaining all parameters needed for classification: eigenfaces, eigenvalues, mean face and scale factor. 3. The test dataset is projected into the training eigenspace, after applying all scaling corrections on the former set. 3. New data is classified in aims to optain the optimum parameters, which are stored appropiately.

At the end of each fold, the average parameter is calculated and stored in the corresponding repetition/distance slot.

for (j in 1:n.rep) {
  train.fold <- matrix(0, nrow = 3, ncol = n.fold)
  test.fold  <- matrix(0, nrow = 3, ncol = n.fold)
  
  train.thres.f <- matrix(0, nrow = 3, ncol = n.fold)
  test.thres.f  <- matrix(0, nrow = 3, ncol = n.fold)
  
  faces <- data.frame(labels = factor(rep(1:80, each = 6)),
                    x = data)
  
  random  <- sample(seq(1, 75, 6), 6)
  indices <- c((random[1]):(random[1]+5), (random[2]):(random[2]+5), (random[3]):(random[3]+5),
               (random[4]):(random[4]+5), (random[5]):(random[5]+5), (random[6]):(random[6]+5))
  unseen <- faces[indices,]
  unseen$labels <- rep("0", 6)
  
  faces <- faces[-indices,]
  faces <- faces[sample(1:nrow(faces)),]
  
  for (i in 1:n.fold)  {
    ind.test <- ((i-1)*nrow(faces)/n.fold+1):(nrow(faces)/n.fold*i)
    train    <- faces[-ind.test,]
    test     <- rbind(faces[ind.test,], unseen)
    
    data.new <- pca(train)
    mean  <- attr(data.new, "mean")
    scale <- attr(data.new, "scale")
    eigenfaces  <- attr(data.new, "eigenfaces")
    eigenvalues <- attr(data.new, "eigenvalues")

    projection  <- data.frame(labels = test$labels,
                              x = t(t(eigenfaces)%*%t(scale(test[,c(2:ncol(test))], center = mean, scale = scale))))
    
    temp <- sapply(1:3, forecast)
    temp <- unlist(temp)
    train.fold[,i]    <- temp[c(1,5,9)]
    train.thres.f[,i] <- temp[c(2,6,10)]
    test.fold[,i]     <- temp[c(3,7,11)]
    test.thres.f[,i]  <- temp[c(4,8,12)]
  }

  train.acc[,j]   <- apply(train.fold, 1, mean)
  train.thres[,j] <- apply(train.thres.f, 1, mean)
  test.acc[,j]    <- apply(test.fold, 1, mean)
  test.thres[,j] <- apply(test.thres.f, 1, mean)
}

2. Parameter selection.

Let us plot our results as to obtain further insight on which is the best combination of parameters.

##        rep.1      rep.2      rep.3      rep.4       rep.5 distance
## 1  164.58814  158.07840  159.42913  157.58935  159.284105        1
## 2   11.02203   10.45183   10.34278   10.30604    9.735511        2
## 3 1334.80163 1239.02661 1308.84421 1265.58574 1238.354216        3

As can be noted from the above plots, the best parameter combination is when using the Manhattan distance and a threshold value of 1308.844 (third repetition). Thus, these are the parameters that we are going to feed our final function.

Function definition

All there is left to do is to design our final classifier with the parameters we just found. For that, it is best if we find the space projection of all our data, so as to train our classifier system as much as possible. We just have to call the function pca on the whole data set and feed this, as well as the mean face, scale factor, eigenvalues and eigenfaces to our final function. The function will then project new input data into this eigenspace and then classify this according to whether or not it is a face it has already seen. The function will return the corresponding 1 to 80 label if the new image belongs to any of the face classes from the training set, or a 0 if this is not the case.

1. Feeding entire dataset.

faces <- data.frame(labels = factor(rep(1:80, each = 6)),
                    x = data)

data.new <- pca(faces)
mean  <- attr(data.new, "mean")
scale <- attr(data.new, "scale")
eigenfaces  <- attr(data.new, "eigenfaces")
eigenvalues <- attr(data.new, "eigenvalues")

2. Function declaration.

The functionality of this function is explained in previous sections of this report. The function itself along with the appropiate parameters is provided on a different R-file. Please be aware that one would need to run the distance function blueprint found in the same document in order for the classifier to produce results.

recognize <- function(images, train, mean, scale, eigenfaces, threshold, type) {
  
  classes <- aggregate(train, list(train$labels), mean)
  classes[,c(1,2)] <- NULL
  classes <- as.matrix(classes)
  test <- t(t(eigenfaces)%*%t(scale(test, center = mean, scale = scale)))

  error <- matrix(0, nrow(classes), nrow(test))
  for (j in 1:nrow(test)) {
    for (i in 1:nrow(classes)) {
      error[i,j] <- distance(unlist(test[j,]), classes[i,], type)
    }
  }
  
  predictions <- c()
  for (i in 1:ncol(error)) {
      pred <- ""
      if (min(error[,i]) > threshold[j]) {
        pred <- "0"
      } else {
        pred <- names(which(min(error[,i]) == error[,i]))
      }
      predictions[i] <- pred
  }
  
  return(predictions)
}