Facial Keypoints Detection

Final Project of ADA

Jun Zhao, Renjie Chen, Rongdong Wu, Xinyu Li, Liangquan Zhou

Introduction

Datasets: Facial Keypoints Detection, Kaggle (https://www.kaggle.com/c/facial-keypoints-detection)

The goal is to locate the specific key points on face images.

The training set has 7049 examples of face images with corresponding keypoint locations. We use this data to train our models.

In total, we have 7049 rows, each one with 31 columns. The first 30 columns are keypoint locations' coordinates, identified as numbers by R. The last column is a string representation of the image, identified as a string. The following image is one of the photos in the training set.

Data Preparation

We splited the keypoint locations, which are the target variables, and the image pixel strings.

Then we transformed the strings into matrixes and each matrix can be visualized as a photo with 96*96 pixels.

We can also show the keypoints for the training set with the keypoint location variables.

Example

Active Shape Model

Basic idea of Active Shape Model

  • The ASM model is trained from manually drawn contours in training images. The ASM model finds the main variations in the training data using Principal Component Analysis (PCA), which enables the model to automatically recognize if a contour is a possible/good object contour.

  • Also the ASM modes contains matrices describing the texture of the lines perpendicular to the control point, these are used to correct the positions in the search step.

initial shape 5 iteration 15 iteration 20 iteration

Basic idea of Active Shape Model

  • After creating the ASM model, an initial contour is deformed by finding the best texture match for the control points. This is an iterative process, in which the movement of the control points is limited by what the ASM model recognizes from the training data as a "normal" object contour.

  • Two parts:

    • 1. Training
    • 2. Searching.

Training

For an image \(i\) with \(m\) keypoints, denote

\[ \mathbf{X_i} = (x_1^i, y_1^i, x_2^i, y_2^i,...,x_m^i, y_m^i), i = 1,2,...,n. \]

where \((x_j^i, y_j^i)\) represents the \(i^{th}\) image's \(j^{th}\) keypoint. \(n\) is the training sample size.

1. Align the training set.

Alignment is to make the train data set under the same coordinate and comparable to each other.

  • Minimize:\(D= \sum |\mathbf{X_i} - \bar{\mathbf{X}}|^2\)

For two shapes \(X_1, X_2\):

\[ \mathbf{X_1} = (x_{11}, y_{11}, x_{12}, y_{12},...,x_{1m}, y_{1m})^T \]

\[ \mathbf{X_2} = (x_{21}, y_{21}, x_{22}, y_{22},...,x_{2m}, y_{2m})^T \]

Training

Align \(X_2\) to \(X_1\), after transformation, zoom and rotation, to

  • Minimize: \[ \begin{equation} E = [\mathbf{X_1} - M[\mathbf{s},\mathbf{\theta}](\mathbf{X_2}) - \mathbf{t}]^T \mathbf{W} [\mathbf{X_1} - M[\mathbf{s},\mathbf{\theta}](\mathbf{X_2}) - \mathbf{t}] \label{(1)} \end{equation} \]

  • \(M[\mathbf{s},\mathbf{\theta}](\mathbf{X})\): scales \(\mathbf{X}\) by s and rotates it by \(\mathbf{\theta}\).

  • \(\mathbf{t} = (t_x,t_y)\): transformation.

  • \(\mathbf{W} = diag(w_1,...,w_m)\): a weighted diagonal matrix to stable keypoints.

    • A stable keypoint means it only has a small movement relative to other keypoints.
    • \(R_{kl}^i\): distance between the kth point and the lth point in the ith shape.
    • \(V_{R_{kl}}\): the variance of \(R_{kl}^i\) in the training set. \(w_k = (\sum\limits_{i=1}^n V_{R_{kl}})^{-1}\)

Then we can use a simple iterative approach to align the training set to the mean shape \(\bar{\mathbf{X}}\).

Training

2. Modelling Shape Variation: PCA.

1. The mean of the shapes: \(\bar{\mathbf{X}} = \frac{1}{n} \sum\limits_{i =1}^{n}\mathbf{X_i}\)

2. The covariance of the shapes: \(\mathbf{S}=\frac{1}{n-1} \sum\limits_{i=1}^{n}(\mathbf{X_i} - \bar{\mathbf{X}})(\mathbf{X_i} - \bar{\mathbf{X}})^T\)

3. Compute first \(t\) eigen values \(\mathbf{S}\): \(\lambda_1,...,\lambda_t\) such that \[ \frac{\sum\limits_{i=1}^t\lambda_i}{\sum\limits_{i=1}^m\lambda_i} > 0.95 V_T \] where \(V_T= \sum \lambda_i\).

Training

Then any shape in the training set can be represented as:

\[ \begin{equation} \mathbf{X_i} \approx \bar{\mathbf{X}} + \mathbf{P} \mathbf{b} \label{2} \end{equation} \]

  • \(\mathbf{P} = (p_1,...,p_t)\): shape parameters.

  • \(\mathbf{b} = \mathbf{P}^T (\mathbf{X_i}-\bar{\mathbf{X}})\): a vetor with t variables

    • defines a set of parameters of a deformable model.
    • By varying the elements of \(\mathbf{b}\) we can vary the shape \(\mathbf{X}\).
    • Applying limits \(-3\sqrt{\lambda_i}< b_i < 3\sqrt{\lambda_i}\) to ensure that the shape generated is similar to those in the original training set.

Training

3. Modelling local structure

  • For a given point, sample along a profile \(k\) pixels either side of the model point in the \(i^{th}\) training image.

  • We have \(2k + 1\) samples which can be put in a vector \(\mathbf{g_i}\). Normalize the sample to reduce the effects of global intensity changes \[ \mathbf{g_i} \rightarrow \frac{1}{\sum_j |g_{ij}|}\mathbf{g_i} \]

  • Repeat for each training image to get a set of normalised samples {\(\mathbf{g_i}\)} for the given model point. Compute mean \(\mathbf{\bar{g}}\) and covariance \(\mathbf{S}_g\).

Use Mahalanobis distance to measure quality of fit of a new sample, \(\mathbf{g}_s\)

\[ \begin{equation}\label{3} f(\mathbf{g}_s) = (\mathbf{g}_s - \mathbf{\bar{g}})^T \mathbf{S}^{-1}_g (\mathbf{g}_s - \mathbf{\bar{g}}) \end{equation} \]

Searching

  • Use the mean shape \(\mathbf{\bar{X}}\) as a rough starting approximation.

  • The searching process is by choosing a set of shape parameters, \(\mathbf{b}\) for the model we define the shape of the object in an object-centred co-ordinate frame. We can create an instance \(\mathbf{X}\) of the model in the image frame by defining the position, orientation and scale. An iterative approach to improving the fit of the instance, \(\mathbf{X}\), to an image proceeds as follows:

    • 1. Examine a region of the image around each point \(\mathbf{X_i}\) to find the best nearby match for the point \(\mathbf{X'_i}\).
    • 2. Update the parameters \((\mathbf{t}, \mathbf{s}, \mathbf{\theta}, \mathbf{b})\) to best fit the new found points \(\mathbf{X}\).
    • 3. Repeat until convergence.

Result

The ASM model can roughly catch the contour in the image ,but the shape is not stable(RMSE > 6).This result may be caused by the insufficient number of the points(15). Since more points could catch the contour more precisely, thus this may impact the precision of ASM .

initial shape 5 iteration

R-part Decision Tree

R-part Decision Tree

Recursive partitioning creates a decision tree that classify members of the population based on different variables.

Process

  • We use the training photos to train models and predict for photos with no key point locations
  • We need to find the eigenvalues of each photo first to fit the r-part decision tree
  • We build a corresponding relations between each photo and a 1*10000 feature matrix
  • The decision tree will classify each photo into different groups by their eigenvalues
  • We get the average key point locations from each group in the training set and then assign the locations to testing set groups

Example

Deviation

The predicted left eye center looks nice under our decision tree if not exactly.

The next step is calculating the RMSE for each key point. In the example above, the x and y coordinates' RMSE of the left eye center detection are 3.617515 and 3.140662, which are acceptable.

After testing other keypoints, we find the classification method with decision tree performs well in the face detection job. The table below shows the RMSE's of all key point coordinates predicted by r-part decision tree.

Results

left_eye_center_x right_eye_center_x left_eye_inner_corner_x
3.617515 3.45202 4.322909
left_eye_outer_corner_x right_eye_inner_corner_x right_eye_outer_corner_x
5.082524 3.631528 4.063313
left_eyebrow_inner_end_x left_eyebrow_outer_end_x right_eyebrow_inner_end_x
5.069073 6.275277 4.47991
right_eyebrow_outer_end_x nose_tip_x mouth_left_corner_x
5.032551 6.650454 5.721183
mouth_right_corner_x mouth_center_top_lip_x mouth_center_bottom_lip_x
5.638065 5.105302 5.377807

Results

left_eye_center_y right_eye_center_y left_eye_inner_corner_y
3.140662 4.027928 3.249608
left_eye_outer_corner_y right_eye_inner_corner_y right_eye_outer_corner_y
3.625195 3.165341 3.776114
left_eyebrow_inner_end_y left_eyebrow_outer_end_y right_eyebrow_inner_end_y
3.695398 4.885833 4.166424
right_eyebrow_outer_end_y nose_tip_y mouth_left_corner_y
5.030088 5.641408 5.286611
mouth_right_corner_y mouth_center_top_lip_y mouth_center_bottom_lip_y
5.252617 7.289094 6.19728

Summary

The deviation is relatively smaller when predicting eye centers and corners, because eyes have many details and the colors in the eye area can be quite various.

In contrast, the errors appear to be larger for the mouth centers as colors on mouth are very close and hard to identify.

Neural Network

First Attempt

At first we tried just one simple "vanilla" neural net, with just one hidden layer:

\[ \begin{align} Z_m&=\sigma(\alpha_0m+\alpha_m^TX),m=1,2,...,M \nonumber\\ T_k&=\beta_{0k}+\beta_k^TZ,k=1,...,K \nonumber\\ f_k(X)&=g_k(T),k=1,...,K \nonumber \end{align} \]

This model came from the book The Elements of Statistical Learning. The \(\sigma(v)\) here is called an activation function, and is usually chosen to be the sigmoid function:

\[ sigmoid(v) = \frac{1}{1+e^{-v}} \]

Training Algorithm

When training the model, usually we want to minimize some kind of cost function. Here in this project, since the submissions are scored on the root mean squared error (RMSE), we use sum of squared errors as our cost function.

\[ R(\theta)=\sum_{k=1}^{K}\sum_{i=1}^{N}(y_{ik}-f_k(x_i))^2 \]

According to the book, the global minimizer of this function is not usually a good solution, cause it "is likely to be an overfit solution." Also, as the model becomes more complex, usually there is no way to theoretical find the minimizer, so an algorithm is usually used to find the solution.

Stochastic Gradient Descent

Because of the nature of our model (composite function), the gradient can be easily calculated using the chain rule. And we utilized that to find the minimum.

Here used an algorithm called stochastic gradient descent (SGD) to train our model. According to Wikipedia, SGD is suitable to minimize "an objective function that is written as a sum of differentiable functions."The procedure is somewhat similar to stepwise regression, we loop through the parameter space to find the argmin.

Results

The result, as we expected, wasn't good. We got an RMSE of 4.03645, however, a model using the historical average of targets has an RMSE of approximately 3.9.

Learning Rate

We then tried several attempt improve the model a little, first we tried to adjust the learning rate:

\[ \mathbf{\beta}_{t+1}=\mathbf{\beta}_t+\Delta \mathbf{\beta}_t \]

where:

\[ \Delta \mathbf{\beta}_t=-\eta g_t \]

\(g_t\) is the gradient and \(\eta\) is the learning rate that controls how large of a step to take.

Learning Rate

Learning Rate Train Objective Valid Objective
0.1 38245.73 186372.25
0.01 197.96 138.23
0.001 196.21 137.64
0.0001 194.87 134.20
0.00001 197.37 135.38

Number of Hidden Units

Given the fact that in general, more hidden units performs better than less, we increase the number of hidden units to 5000. It took a long time to train the model, but the result is still not better.

RMSE: 4.04086

Other Activation Functions

Aside from the sigmode, we can also use a tanh function as the activation function:

\[ tanh=\frac{1-e^{-2x}}{1+e^{-2x}} \]

RMSE: 7.10821

What's Wrong

By far we've noticed that this "vanilla" neural network model is not really suitable for this task, we suspected that the model failed to establish a connection between the features and the responses. When we visualized the keypoints we saw that the positions of each keypoint are similar from image to image, which confirmed our suspicion.

We belief the reason is the difference between the features and the responses. For each record, the features vector is a vector of length 9126, which is just a flattened grayscale image, where as the responses are 30 values between 0 and 96, each represents its position over the \(x\) or \(y\) coordinate. And the computer is having trouble connecting the two.

Experiment

To verify this, we tried an example where we replaced all the images with random values between 0 and 255, and fitted the model again. And we got an RMSE that is similar to the previous result, indicating the model got little information from features.

Quick Fix

We tried to fix the fact that the model is not learning by inserting one layer in the middle that contains 15 or 30 hidden units, in hopes of providing the model the information that we are trying to predict coordinates.

As it turned out, inserting a hidden layer with 15 units helped a lot, the model is finally starting to learn, and we finally beat the historical average model with an RMSE of 3.936697.

Unsupervised Pretraining

An painful but promising way to improve the performance is to use some pretraining method, here we used the denoising autoencoder (DAE).

It is usually used for the purpose of dimension reduction. But in there we're hoping that it can provide a compressed set of features that our model will understand.

Results

The results seem to be okay with an RMSE of 3.81368, which is the lowest we've achieved so far.

Results

curves