Using Multivariate Gaussian, Mahalanobis Distance and F1 measure to choose the right probability threshold from the Validation to detect outliers

In this article, simple multivariate gaussian distribution will be used to find the outliers in an image.

We shall use the following apples and oranges image for the outlier detection.

The colors R,G,B will form the variables for this image data, as shown in the following figure.

Data File Format

First we fit a 3-dimensional gaussian distribution to the image data, we use MLE estimates for the parameters of the Gaussian distribution. The pdf for the multivariate Gaussian has the followig form (we need to estimate the mean and the covariance matrix).

Data File Format

After estimating the distribution the probability that each of the data point (pixel) comes from the distribution is computed. The (discretized) probability values are overlayed (as alpha values) on the image itself to visualize the data points with low probabilities (low alpha values).

## [1] "MLE estimate for mean"

##         r         g         b 
## 0.5693976 0.4987922 0.1681461

## [1] "MLE estimate for covariance matrix"

##            [,1]       [,2]       [,3]
## [1,] 0.05288263 0.00000000 0.00000000
## [2,] 0.00000000 0.04169364 0.00000000
## [3,] 0.00000000 0.00000000 0.02009812

## [1] "Visualizing Gaussian fit"

The following animation shows the outlier detection in the image based on probability thresholds.

Next the Mahalanobis distance \(d=\sqrt((x-\mu)^T\Sigma^{-1}(x-\mu))\) -based threshold is used to mark the outlier points in the image, as shown in the following animation.

Data File Format

Finally the image dataset is going to be divided into training and validation datasets.
The following two white cut portions from the image are going to be used as validation dataset, the first one (the points from orange) with label 1 (since we want orange to be detected as outlier) and the second one with label 0, as shown below. The rest of the image is going to be used as training dataset, from which the estimated parameters for the multivariate Gaussian fit distribution is computed.

Now the probability for each of the data points in the validation dataset will be computed and this validation dataset will be used to find the best probability threshold that gives the best F1-measure. Now this probability threshold is going to be used to find the outliers in the entire image. The pixels with probability of Gaussian fit less than this threshold will be marked as black outliers. The following figures show the results.

## [1] "MLE estimate for mean from the training dataset"

##         r         g         b 
## 0.5409833 0.4860903 0.1600407

## [1] "MLE estimate for covariance matrix from the training dataset"

##            [,1]       [,2]       [,3]
## [1,] 0.04874885 0.00000000 0.00000000
## [2,] 0.00000000 0.04163396 0.00000000
## [3,] 0.00000000 0.00000000 0.01906682

## [1] "Best epsilon found using cross-validation: 2.030149e-01"

## [1] "Best F1 on Cross Validation Set:  0.774317"