The EM of K-Means and Missing Values

Algoritma EM melalui contoh klasterisasi k-means

Algoritma paling terkenal untuk mengelompokkan observasi ke dalam grup adalah algoritma k-means. Kita akan melihat bahwa algoritma ini hanyalah varian dari algoritma EM. Diberikan \(n\) objek, yang dikarakterisasi oleh \(p\) variabel, kita ingin membaginya ke dalam \(n_C\) klaster \(\{ C_1, C_2, \ldots, C_{n_C} \}\) sehingga setiap klaster \(C_k\) memiliki \(n_{(k)}\) anggota dan setiap observasi berada dalam satu klaster. Vektor rata-rata (titik pusat, prototipe), \(\mathbf{v}_k\), dari klaster \(C_k\) didefinisikan sebagai centroid dari klaster tersebut dan komponennya dapat dihitung dengan:

\[ \mathbf{v}_k (\in \mathbb{R}^p) = \left( \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{i1}^{(k)}, \ldots, \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{ip}^{(k)} \right) \]

dimana \(n_{(k)}\) adalah jumlah observasi dalam klaster \(C_k\) dan \(x_i^{(k)}\) adalah observasi ke-\(i\) yang termasuk dalam klaster \(C_k\). Untuk setiap klaster \(C_1, \ldots, C_{n_C}\), vektor rata-rata yang sesuai \(\mathbf{V} = \{ \mathbf{v}_1, \ldots, \mathbf{v}_{n_C} \}\) dihitung.

Kita juga perlu menentukan jumlah klaster dalam hasil akhir. Dimulai dari lokasi awal centroid \(n_C\) klaster, algoritma ini menggunakan data untuk secara iteratif mengatur ulang centroid dan mengalokasikan kembali titik-titik ke centroid terdekat. Prosesnya terdiri dari langkah-langkah berikut:

Pilih partisi awal dengan \(n_C\) klaster.
E-step: (Ulangi) hitung pusat klaster menggunakan keanggotaan klaster saat ini.
M-step: Tetapkan setiap objek ke pusat klaster terdekat, perbarui keanggotaan.
Ulangi langkah 2 hingga keanggotaan klaster dan centroid tidak berubah lagi di luar batas tertentu.

Klasterisasi k-means mengoptimalkan fungsi objektif berikut:

\[ J(\mathbf{X}, \mathbf{V}, \mathbf{U}) = \sum_{k=1}^{n_c} \sum_{i=1}^{n} u_{ik} d^2(\mathbf{x}_i, \mathbf{v}_k) \]

dimana:

\(\mathbf{X} = \{ \mathbf{x}_1, \ldots, \mathbf{x}_n \}\) adalah himpunan data dengan observasi dan variabel,
\(\mathbf{v}_1, \ldots, \mathbf{v}_{n_c}\) adalah matriks pusat klaster (prototipe) berdimensi \(n_c \times p\),
\(\mathbf{U} = [u_{ik}]\) adalah matriks koefisien keanggotaan dari observasi \(\mathbf{x}_i\) ke klaster ke-\(k\),
\(\mathbf{U}\) berdimensi \(n \times n_c\),
\(d\) adalah jarak Euclidean antara observasi dan pusat klaster,
\(n_c\) adalah jumlah klaster.

Langkah-langkah Algoritma K-Means

Tetapkan \(n_c\), \(2 \leq n_c < n\), dan pilih toleransi terminasi \(\delta > 0\), misalnya 0.001.

Inisialisasi \(\mathbf{U}^{(0)}\) secara acak.

ULANGI untuk \(r = 1, 2, \ldots\)

E-step: Hitung pusat klaster:

\[ \mathbf{v}_k = \frac{1}{\sum_{i=1}^n u_{ik}} \left( \sum_{i=1}^{n} u_{ik} x_{i1}, \ldots, \sum_{i=1}^{n} u_{ik} x_{ip} \right), \quad \text{untuk } k = 1, \ldots, n_c \]

M-step: Perbarui \(\mathbf{U}^{(r)}\): alokasikan ulang keanggotaan klaster.
Perbarui nilai \(u_{ij}^{(r)}\):

\[ u_{ij}^{(r)} = \begin{cases} 1 & \text{jika } d(\mathbf{x}_i, \mathbf{v}_j^{(r)}) = \min\limits_{1 \leq l \leq n_c} d(\mathbf{x}_i, \mathbf{v}_l^{(r)}) \\ 0 & \text{lainnya} \end{cases} \]

Hingga \(| \mathcal{N}^{(r)} - \mathcal{N}^{(r-1)} | < \delta\)

Fungsi Jarak

Untuk algoritma klaster, kita membutuhkan fungsi jarak.
Di sini kita menggunakan jarak Manhattan:

\[ d_{\text{Manhattan}}(\mathbf{x}_i, \mathbf{v}_k) = \sum_{j=1}^{p} |x_{ij} - v_{kj}| \]

distMan <- function(x, centers){
if(class(x) == "data.frame") x <- as.matrix(x)
d <- matrix(0, nrow=nrow(x), ncol=nrow(centers))
## dist center to observations for each cluster
for(k in 1:nrow(centers)){
d[,k] <- abs( colSums((t(x) - centers[k,])) )
}
return(d)
}

Dan kita memerlukan sebuah fungsi untuk menghitung nilai tengah (mean), misalnya kita bisa menggunakan rata-rata aritmatika, tetapi kita juga bisa menggunakan median seperti pada contoh berikut:

means <- function(x, cluster){
cen <- matrix(NA, nrow=max(cluster), ncol <- ncol(x))
## cluster means for each cluster
for(n in 1:max(cluster)){
cen[n,] <- apply(x[cluster==n,], 2, median)
}
return(cen)
}

Kita menuliskan sebuah fungsi untuk algoritma k-means, yang mengimplementasikan rumus-rumus sebelumnya. Untuk menampilkan beberapa plot yang memperlihatkan pendekatan EM secara rinci, kita melakukannya dalam sebuah perulangan (for loop):

my_kmeans <- function(x, k, clmeans = means, distance = distMan, iter =
99999, seed = NULL){
if(!is.null(seed)) set.seed(seed)
cent <- newcent <- x[sample(1:nrow(x), size=k), ]
oldclust <- 0
j <- 0
for(i in 1:iter){ # better: while()
j <- j + 1
cent <- newcent
## M-step
dist <- distance(x, cent)
clust <- max.col(-dist)
## E-step
newcent <- clmeans(x, clust)
if(all(clust == oldclust)) break()
oldclust <- clust
}
res <- list(centers = cent,
cluster = factor(clust),
iterations = j)
return(res)
}

Seperti yang dapat kita lihat, dalam pengelompokan , langkah E () adalah langkah penyesuaian () dan langkah M () adalah langkah penugasan ().

Melakukan iterasi terhadap langkah E dan M secara berulang akan memperbaiki solusi secara bertahap. Ini berarti bahwa fungsi objektif \[ J(\mathbf{X}, \mathbf{V}, \mathbf{U}) \] akan semakin kecil pada setiap iterasi. Algoritma dihentikan jika penugasan klaster tidak mengalami perubahan lagi.

Mari kita ambil beberapa data. Kita ingin menjaga prosesnya tetap sederhana dan menampilkan hasil klasterisasi secara visual. Oleh karena itu, kita menggunakan himpunan data dua dimensi seperti yang ditunjukkan pada Gambar 9.1.

# Load data dan package
library(flexclust)
library(ggplot2)

# Load dataset
data(Nclus, package = "flexclust")
x <- data.frame(Nclus)

# Terapkan k-means clustering
set.seed(123)  # agar hasilnya reproducible
k <- 3  # jumlah klaster, bisa disesuaikan
km <- kmeans(x, centers = k)

# Tambahkan hasil klaster ke data
x$cluster <- as.factor(km$cluster)

# Visualisasikan hasil clustering
ggplot(x, aes(x = X1, y = X2, color = cluster)) +
  geom_point(size = 3) +
  labs(title = "Visualisasi Klaster dengan K-Means",
       x = "X1",
       y = "X2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")  # palet warna bagus dan jelas

In the following we plot the results after iterations 1, 2, and after convergence. Instead of our simplified implementation of the k-means algorithm, we use the default k-means implementation of R. Some variants of the k-means exist, where we chose the algorithm of “MacQueen”, but only in order to explore the algorithm (the default method, “Hartigan-Wong” is converging too fast to show the steps of the algorithm). Note that the k-means algorithm starts with randomly chosen cluster centers. Thus we have to set the seed to ensure the same starts in each call of the k-means:

set.seed(123456)
cl1 <- kmeans(x, centers = 4, iter.max = 1, algorithm = "MacQueen")

## Warning: did not converge in 1 iteration

set.seed(123456)
cl2 <- kmeans(x, centers = 4, iter.max = 2, algorithm = "MacQueen")

## Warning: did not converge in 2 iterations

set.seed(123456)
cl3 <- kmeans(x, centers = 4, iter.max = 3, algorithm = "MacQueen")

## Warning: did not converge in 3 iterations

set.seed(123456)
cl4 <- kmeans(x, centers = 4, algorithm = "MacQueen")

cl1$centers

##           X1         X2  cluster
## 1  1.2453209  1.8406442 1.506329
## 2 -1.0333581  5.5847846 1.798913
## 3  0.1149815 -0.7912543 1.965517
## 4  6.0215732  2.4964653 1.855895

Note that the k-means only takes the centers into account and works with a distance function to calculate the distance from the observations to the cluster centers. Another (more favorable) approach is to incorporate them within the shape of the clusters. This is implemented in the model-based clustering framework (Fraley and Raftery 2002). The model-based procedures mostly give better clustering results (Templ, Filzmoser, and Reimann 2008) but they are computationally more complex since in each E-step the covariance of each cluster must be additionally calculated.

The EM algorithm for the imputation of missing values

The EM algorithm is extensively used for the imputation of missing values. Implementations include (van Buuren and Groothuis-Oudshoorn 2011), (Schafer 1997), (Templ, Alfons, and Filzmoser 2011), (Raghunathan et al. 2001), and (Gelman and Hill 2011). In the following we want to show how an EM algorithm works generally for these kind of problems. First we take a data set to impute. We select again the sleep data:

 library("MASS")
 library("robustbase")
 library("VIM")

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

 data("sleep")
 str(sleep)

## 'data.frame':    62 obs. of  10 variables:
##  $ BodyWgt : num  6654 1 3.38 0.92 2547 ...
##  $ BrainWgt: num  5712 6.6 44.5 5.7 4603 ...
##  $ NonD    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ Dream   : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ Sleep   : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ Span    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ Gest    : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ Pred    : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ Exp     : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ Danger  : int  3 3 1 3 4 4 1 4 1 1 ...

Missing values are included in some of the all variables, like in variable Sleep:

apply(sleep, 2, function(x) any(is.na(x)))

##  BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest     Pred 
##    FALSE    FALSE     TRUE     TRUE     TRUE     TRUE     TRUE    FALSE 
##      Exp   Danger 
##    FALSE    FALSE

How can we impute missing values in the variable Sleep? This might be done by performing regression fits. Ideally, we already initialize missing values from a
well-performing algorithm for imputation, for example, a k-nearest neighbor imputation approach (Templ, Alfons, and Filzmoser 2011). However, to see the progress of the EM, we start very badly regarding the initialization of missing values in the variable Sleep, using a worst initialization just with a large number. Note that we also need the index of the missing values:

 ## index of missing values
 ind <- data.frame(is.na(sleep))
 ## initialization
 sleep <- kNN(sleep)
 ## Time difference of 0.04911399 secs
 ## overwrite missing initialization with bad choice
 sleep$Sleep[ind$Sleep] <- 2240  # bad initialization
 ## initialized missing values in variable sleep
 sleep$Sleep[ind$Sleep]

## [1] 2240 2240 2240 2240

We then fit a model for the first variable. The model results (regression coefficients) are then used to predict the missing values

 ## E-step (1)
 lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data = 
sleep)
 ## M-step (1)
 sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
 ## print of updated missing values
 sleep$Sleep[ind$Sleep]

## [1] 469.5127 559.9771 408.6845 229.0985

We may continue with the second iteration

 ## E-step (2)
 lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data = 
sleep)
 ## M-step (2)
 sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
 ## print of updated missing values
 sleep$Sleep[ind$Sleep]

## [1] 101.9265 121.6146  90.1618  48.7181

We see that the values still change a lot. Let’s do this iteratively until the values will not change more than within a very small threshold. We can write a small function for the imputation of the variable Sleep:

 EMregSleep <- function(method = lm, eps = 0.001, init = "bad"){
  ## index of missing values
  ind <- is.na(sleep)
  colnames(ind) <- colnames(sleep)
  indsleep <- ind[, "Sleep"]
  ## initialization
  if(init == "bad"){
    sleep <- kNN(sleep, imp_var = FALSE)
    sleep$Sleep[indsleep] <- 2240  # bad initialization
  }
  if(init == "worst"){
    sleep[ind] <- 2240  # worst initialization
  }
  iteration <- 0
  criteria <- 99999
  while(criteria > eps){
    iteration <- iteration + 1
    prev_sol <- sleep$Sleep[indsleep]
    ## E-step
    lm1 <- method(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger,
                  data = sleep)
    ## M-step
    sleep$Sleep[indsleep] <- predict(lm1)[indsleep]
    criteria <- sqrt(sum((prev_sol - sleep$Sleep[indsleep])^2))
  }
  res <- list("imputed" = sleep, 
              "iteration" = iteration,
              lastmodel = lm1)
  return(res)
}

Again we load the data set sleep, impute it and look at the imputed values in the variable Sleep:

data("sleep")
 sleepImp <- EMregSleep()
 ## Time difference of 0.179677 secs
 missVals <- sleepImp$imputed$Sleep[ind$Sleep]
 missVals

## [1]  3.845778 13.122764  3.658173 16.975766

 ## [1]  3.845778 13.122764  3.658173 16.975766
 sleepImp$iteration

## [1] 11

However, we imputed with expected values, which will lead to a decrease of variance since we didn’t account for the uncertainty and distribution of missing values (compare the variance estimation with missing values in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests). To consider the variance, we sampled residuals (compare the bootstrapping residuals regression approach in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests):

 missVals + sample(residuals(sleepImp$lastmodel), length(missVals))

##         7        45        54        36 
##  4.122294 13.199528  3.844590 17.827567

Previously we saw that we needed 11 iterations. In addition, the OLS regression model might also be influenced from outliers, thus it is better to replace it with arobust method. We already see good results after the first iteration (not shown here). We get slightly different results that are usually more trustable than using non-robust methods (Templ, Kowarik, and Filzmoser 2011). The OLS results may become corrupted, especially with outliers also in the predictors. However, we see that even with the worst initialization (also huge outliers in the predictors) the results looks fine (although we prefer the robust method anyhow):

data("sleep")
 ## OLS regression
 lm_ols <- EMregSleep(method = lm, init = "worst")
 ## M-estimation
 lm_rlm <- EMregSleep(method = rlm, init = "worst", eps= 0.01)
 lm_ols$imputed$Sleep[ind[, "Sleep"]]

## [1]  4.239191  8.169014  4.368256 13.775087

 ## [1]  4.239191  8.169014  4.368256 13.775087
 lm_rlm$imputed$Sleep[ind[, "Sleep"]]

## [1]  3.766792  7.788943  3.925772 13.700029

From these figures we can see that the OLS results are highly influenced by outliers. Compared to the previous estimates (using the bad initialization, not the worst one), the imputed values are too high. This is not as extreme when using an M-estimator, but compared to the implementation in the function irmi (see the following example) we underestimate the second and fourth value.

We have discussed how to impute one variable, but in general we want to impute all variables in a data set. The data set may also consist not only of continuous variables but also of a mix of continuous, semi-continuous, categorical, binary,and/or count variables. The robust EM-based imputation accounts for this (and more, such as specifying a model for each variable) and is implemented in the function irmi (Templ, Kowarik, and Filzmoser 2011) in the R package VIM (Templ, Alfons, and Filzmoser 2011):

 data("sleep")
 sleepImp <- irmi(sleep)
 ## Time difference of 0.03798294 secs
 sleepImp[ind[, "Sleep"], "Sleep"]

## [1]  3.748899 10.089591  3.156300 17.085060

We see this is very close to the initial solution where we took a better initialization of the missing values. This is an indication of the successfulness of irmi. We may use another method, such as mice (as irmi is usually used for multiple imputation):

library("mice")

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:flexclust':
## 
##     bwplot, densityplot

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## Loading required package: Rcpp
## mice 2.25 2015-11-09
data("sleep")
em_mice <- mice(sleep, m = 1)

## 
##  iter imp variable
##   1   1  NonD  Dream  Sleep  Span  Gest
##   2   1  NonD  Dream  Sleep  Span  Gest
##   3   1  NonD  Dream  Sleep  Span  Gest
##   4   1  NonD  Dream  Sleep  Span  Gest
##   5   1  NonD  Dream  Sleep  Span  Gest

## Warning: Number of logged events: 2

 em_mice$imp$Sleep

##       1
## 21 12.8
## 31  9.8
## 41  2.6
## 62 11.2

 sleep[is.na(sleep)] <- 2240
 sleep$Sleep[ind[, "Sleep"]] <- NA
 em_mice <- mice(sleep, m = 1)

## 
##  iter imp variable
##   1   1  Sleep
##   2   1  Sleep
##   3   1  Sleep
##   4   1  Sleep
##   5   1  Sleep

## Warning: Number of logged events: 5

em_mice$imp$Sleep

##       1
## 21  3.9
## 31  2.6
## 41  3.1
## 62 13.8

We see that we get a completely different result, as soon outliers are present in the data set. This is also true for other implementations of the EM algorithm, that are only suitable when the data set is approx. multivariate normal. As soon as this is violated (as typically is the case in practice), irmi might be a good choice. Note that in irmi many other parameters can be specified (not discussed here):

 args(irmi)

## function (x, eps = 5, maxit = 100, mixed = NULL, mixed.constant = NULL, 
##     count = NULL, step = FALSE, robust = FALSE, takeAll = TRUE, 
##     noise = TRUE, noise.factor = 1, force = FALSE, robMethod = "MM", 
##     force.mixed = TRUE, mi = 1, addMixedFactors = FALSE, trace = FALSE, 
##     init.method = "kNN", modelFormulas = NULL, multinom.method = "multinom", 
##     imp_var = TRUE, imp_suffix = "imp") 
## NULL

Summary

The EM algorithm is a computation approach to find a solution for maximum likelihood estimators. Basically, the EM algorithm consists of two steps, the E-step for estimation of parameters and the M-step for maximization according to the actual parameters. The algorithm usually converges quickly and is applied in many areas. In this chapter we saw the application in two areas, in clustering and in imputation of missing values. Clustering is an NP-hard problem; loosely speaking, we cannot f ind the exact closed-form solution in a reasonable time. The EM algorithm is therefore necessary to interactively find a good solution. In clustering, the EM algorithm is implemented for the k-means clustering algorithm, but also (not shown in this chapter) for model-based clustering and for mixture models in general. Missing values occur frequently in data sets in practice. Data scientists are probably those people whose main job is in data pre-processing, thus they also have to impute missing values. We saw that the EM algorithm is a central tool for this task