K-Means Clustering Investigation

Introduction

Clustering is one of the important component of unsupervised machine learning. The primary object of this research was to comprehend the logic and implementation of K-Means algorithm, clustering histograms through dataset and image segmentation or clustering on time series data. The modeling on handwritten digits from built in datasets were tested and implemented on mnist training dataset. A final matrix of 25 random digits was printed.

DataSet

Data was first tested on Based R built-in handwritten digits then model was implemented on mnist training dataset with 60,000 handwritten digits examples.

Training Set source: http://yann.lecun.com/exdb/mnist/

Models and Techniques

The wasserstein distance metric between probability distributions, also known as Earth Mover’s Distance (EMD), is a form of optimal transport distance representing the “minimal amount” of work to transport something.

Conceptually, EMD can be viewed as an analogy between transferring weights or boxes. One can view each probability distribution as a histogram where the height bar is associated with a big and a corresponding weight and coordinate in multidimensional vector space. Consequently, measuring the distance between grayscale images is equivalent to measuring the distance between two histogram heights where the cost of transforming one histogram into another is calculated via histogram weights and coordinates.
Transforming a first histogram into a second one involves moving weights from the “bins” of one histogram into the “bins” of the second. In order to minimize the total “distance traveled” or the total amount of weight transferred, an optimization problem must be defined whose discrete formulation leads to the transportation theory and Wasserstein distance formula.

As described above, this newly defined measure of distances is referred to as Earth Mover’s Distance in the field of image processing, Word Mover’s Distance (WMD) in the field of text processing, and Wasserstein Distance in the field of Optimization Theory.

The K-means clustering algorithm leverages the Wasserstein metrics by using a sparse discrete point set to cluster denser or continuous distributional data with respect to the Wasserstein distance between the original data and the sparse representation. This is equivalent to finding the Wasserstein barycenter of a single distribution.

Code and Result

Below are the packages that we will be using:

library(Barycenter)

## Warning: package 'Barycenter' was built under R version 4.0.5

## Loading required package: Rcpp

## Warning: package 'Rcpp' was built under R version 4.0.5

library(Rcpp)
library(HistDAWass)

## Warning: package 'HistDAWass' was built under R version 4.0.5

Here is a sample of visualizing 5 different handwritten images of the number “8”

image(eight[[1]])

image(eight[[2]])

image(eight[[3]])

image(eight[[4]])

image(eight[[5]])

Now we will compute the barycenter of these images and visualize the barycenter using Barycenter::WaBarycenter() function

barycenter <- WaBarycenter(eight)

##    user  system elapsed 
##   69.98    0.11   70.24

image(barycenter)

Below is the EMD clustering algorithm using Wasserstein distances on the MNIST training data set:

to.read = file("C:\\Users\\liulu\\Downloads\\train-images.idx3-ubyte", 'rb')

# create lists of histogram distributions
img_mnist = vector("list",25)
for (i in 1:25) {
  m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28)
  img_mnist[[i]] = data2hist(m, type = "regular")
  image(m[,28:1])
}

to.read = file("C:\\Users\\liulu\\Downloads\\train-images.idx3-ubyte", 'rb')
readBin(to.read, integer(), n=4, endian="big")

## [1]  2051 60000    28    28

par(mfrow=c(5,5))
par(mar=c(0,0,0,0))
for(i in 1:25){m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}

# combine separate lists into a matrix of histogram objects
mymat = new("MatH", nrows=25, ncols=1, ListOfDist=img_mnist, names.rows=c(1:25), names.cols="density")
WH_adaptive.kmeans(mymat, k=3)

## $IDX
##  [1] 3 3 3 1 2 3 1 3 1 2 2 1 3 3 1 2 3 3 2 2 2 2 2 1 2
## 
## $cardinality
## IDX
## Cl.1 Cl.2 Cl.3 
##    6   10    9 
## 
## $proto
## a matrix of distributions 
##  1  variables  3  rows 
##  each distibution in the cell is represented by the mean and the standard deviation 
##                 X1             
## I1 [m= -0.21688  ,s= 17.214 ] 
## I2 [m= -0.031778  ,s= 22.594 ]
## I3  [m= -1.4529  ,s= 26.202 ] 
## 
## $weights
##               Cl.1 Cl.2 Cl.3
## density(Cen.)    1    1    1
## density(Var.)    1    1    1
## 
## $Crit
## [1] 341.9296
## 
## $TOTSSQ
## [1] 1041.913
## 
## $BSQ
## [1] 699.9838
## 
## $WSQ
## [1] 341.9296
## 
## $quality
## [1] 0.6718253

Future Goal

In real life, scientists cluster time series for different types of COVID time series data. On the other hand, image segmentation groups foreground and background by similarities. We were not able to implement image segmentation step nor EMD-clustering for time series data due to time constraints. Our future goal is using public COVID time series data we found online to find major grouping patterns and prototype trends based on predetermined conditions.