Algoritma paling terkenal untuk mengelompokkan observasi ke dalam grup adalah algoritma k-means. Kita akan melihat bahwa algoritma ini hanyalah varian dari algoritma EM. Diberikan \(n\) objek, yang dikarakterisasi oleh \(p\) variabel, kita ingin membaginya ke dalam \(n_C\) klaster \(\{ C_1, C_2, \ldots, C_{n_C} \}\) sehingga setiap klaster \(C_k\) memiliki \(n_{(k)}\) anggota dan setiap observasi berada dalam satu klaster. Vektor rata-rata (titik pusat, prototipe), \(\mathbf{v}_k\), dari klaster \(C_k\) didefinisikan sebagai centroid dari klaster tersebut dan komponennya dapat dihitung dengan:
\[ \mathbf{v}_k (\in \mathbb{R}^p) = \left( \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{i1}^{(k)}, \ldots, \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{ip}^{(k)} \right) \]
dimana \(n_{(k)}\) adalah jumlah observasi dalam klaster \(C_k\) dan \(x_i^{(k)}\) adalah observasi ke-\(i\) yang termasuk dalam klaster \(C_k\). Untuk setiap klaster \(C_1, \ldots, C_{n_C}\), vektor rata-rata yang sesuai \(\mathbf{V} = \{ \mathbf{v}_1, \ldots, \mathbf{v}_{n_C} \}\) dihitung.
Kita juga perlu menentukan jumlah klaster dalam hasil akhir. Dimulai dari lokasi awal centroid \(n_C\) klaster, algoritma ini menggunakan data untuk secara iteratif mengatur ulang centroid dan mengalokasikan kembali titik-titik ke centroid terdekat. Prosesnya terdiri dari langkah-langkah berikut:
Klasterisasi k-means mengoptimalkan fungsi objektif berikut:
\[ J(\mathbf{X}, \mathbf{V}, \mathbf{U}) = \sum_{k=1}^{n_c} \sum_{i=1}^{n} u_{ik} d^2(\mathbf{x}_i, \mathbf{v}_k) \]
dimana:
Tetapkan \(n_c\), \(2 \leq n_c < n\), dan pilih toleransi terminasi \(\delta > 0\), misalnya 0.001.
Inisialisasi \(\mathbf{U}^{(0)}\) secara acak.
ULANGI untuk \(r = 1, 2, \ldots\)
\[ \mathbf{v}_k = \frac{1}{\sum_{i=1}^n u_{ik}} \left( \sum_{i=1}^{n} u_{ik} x_{i1}, \ldots, \sum_{i=1}^{n} u_{ik} x_{ip} \right), \quad \text{untuk } k = 1, \ldots, n_c \]
M-step: Perbarui \(\mathbf{U}^{(r)}\): alokasikan ulang keanggotaan klaster.
Perbarui nilai \(u_{ij}^{(r)}\):
\[ u_{ij}^{(r)} = \begin{cases} 1 & \text{jika } d(\mathbf{x}_i, \mathbf{v}_j^{(r)}) = \min\limits_{1 \leq l \leq n_c} d(\mathbf{x}_i, \mathbf{v}_l^{(r)}) \\ 0 & \text{lainnya} \end{cases} \]
Untuk algoritma klaster, kita membutuhkan fungsi jarak.
Di sini kita menggunakan jarak Manhattan:
\[ d_{\text{Manhattan}}(\mathbf{x}_i, \mathbf{v}_k) = \sum_{j=1}^{p} |x_{ij} - v_{kj}| \]
distMan <- function(x, centers){
if(class(x) == "data.frame") x <- as.matrix(x)
d <- matrix(0, nrow=nrow(x), ncol=nrow(centers))
## dist center to observations for each cluster
for(k in 1:nrow(centers)){
d[,k] <- abs( colSums((t(x) - centers[k,])) )
}
return(d)
}
Dan kita memerlukan sebuah fungsi untuk menghitung nilai tengah (mean), misalnya kita bisa menggunakan rata-rata aritmatika, tetapi kita juga bisa menggunakan median seperti pada contoh berikut:
means <- function(x, cluster){
cen <- matrix(NA, nrow=max(cluster), ncol <- ncol(x))
## cluster means for each cluster
for(n in 1:max(cluster)){
cen[n,] <- apply(x[cluster==n,], 2, median)
}
return(cen)
}
Kita menuliskan sebuah fungsi untuk algoritma k-means, yang mengimplementasikan rumus-rumus sebelumnya. Untuk menampilkan beberapa plot yang memperlihatkan pendekatan EM secara rinci, kita melakukannya dalam sebuah perulangan (for loop):
my_kmeans <- function(x, k, clmeans = means, distance = distMan, iter =
99999, seed = NULL){
if(!is.null(seed)) set.seed(seed)
cent <- newcent <- x[sample(1:nrow(x), size=k), ]
oldclust <- 0
j <- 0
for(i in 1:iter){ # better: while()
j <- j + 1
cent <- newcent
## M-step
dist <- distance(x, cent)
clust <- max.col(-dist)
## E-step
newcent <- clmeans(x, clust)
if(all(clust == oldclust)) break()
oldclust <- clust
}
res <- list(centers = cent,
cluster = factor(clust),
iterations = j)
return(res)
}
Seperti yang dapat kita lihat, dalam pengelompokan , langkah E () adalah langkah penyesuaian () dan langkah M () adalah langkah penugasan ().
Melakukan iterasi terhadap langkah E dan M secara berulang akan memperbaiki solusi secara bertahap. Ini berarti bahwa fungsi objektif \[ J(\mathbf{X}, \mathbf{V}, \mathbf{U}) \] akan semakin kecil pada setiap iterasi. Algoritma dihentikan jika penugasan klaster tidak mengalami perubahan lagi.
Mari kita ambil beberapa data. Kita ingin menjaga prosesnya tetap sederhana dan menampilkan hasil klasterisasi secara visual. Oleh karena itu, kita menggunakan himpunan data dua dimensi seperti yang ditunjukkan pada Gambar 9.1.
# Load data dan package
library(flexclust)
library(ggplot2)
# Load dataset
data(Nclus, package = "flexclust")
x <- data.frame(Nclus)
# Terapkan k-means clustering
set.seed(123) # agar hasilnya reproducible
k <- 3 # jumlah klaster, bisa disesuaikan
km <- kmeans(x, centers = k)
# Tambahkan hasil klaster ke data
x$cluster <- as.factor(km$cluster)
# Visualisasikan hasil clustering
ggplot(x, aes(x = X1, y = X2, color = cluster)) +
geom_point(size = 3) +
labs(title = "Visualisasi Klaster dengan K-Means",
x = "X1",
y = "X2") +
theme_minimal() +
scale_color_brewer(palette = "Set1") # palet warna bagus dan jelas
In the following we plot the results after iterations 1, 2, and after
convergence. Instead of our simplified implementation of the k-means
algorithm, we use the default k-means implementation of R. Some variants
of the k-means exist, where we chose the algorithm of “MacQueen”, but
only in order to explore the algorithm (the default method,
“Hartigan-Wong” is converging too fast to show the steps of the
algorithm). Note that the k-means algorithm starts with randomly chosen
cluster centers. Thus we have to set the seed to ensure the same starts
in each call of the k-means:
set.seed(123456)
cl1 <- kmeans(x, centers = 4, iter.max = 1, algorithm = "MacQueen")
## Warning: did not converge in 1 iteration
set.seed(123456)
cl2 <- kmeans(x, centers = 4, iter.max = 2, algorithm = "MacQueen")
## Warning: did not converge in 2 iterations
set.seed(123456)
cl3 <- kmeans(x, centers = 4, iter.max = 3, algorithm = "MacQueen")
## Warning: did not converge in 3 iterations
set.seed(123456)
cl4 <- kmeans(x, centers = 4, algorithm = "MacQueen")
cl1$centers
## X1 X2 cluster
## 1 1.2453209 1.8406442 1.506329
## 2 -1.0333581 5.5847846 1.798913
## 3 0.1149815 -0.7912543 1.965517
## 4 6.0215732 2.4964653 1.855895
Note that the k-means only takes the centers into account and works with a distance function to calculate the distance from the observations to the cluster centers. Another (more favorable) approach is to incorporate them within the shape of the clusters. This is implemented in the model-based clustering framework (Fraley and Raftery 2002). The model-based procedures mostly give better clustering results (Templ, Filzmoser, and Reimann 2008) but they are computationally more complex since in each E-step the covariance of each cluster must be additionally calculated.
The EM algorithm is extensively used for the imputation of missing values. Implementations include (van Buuren and Groothuis-Oudshoorn 2011), (Schafer 1997), (Templ, Alfons, and Filzmoser 2011), (Raghunathan et al. 2001), and (Gelman and Hill 2011). In the following we want to show how an EM algorithm works generally for these kind of problems. First we take a data set to impute. We select again the sleep data:
library("MASS")
library("robustbase")
library("VIM")
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
data("sleep")
str(sleep)
## 'data.frame': 62 obs. of 10 variables:
## $ BodyWgt : num 6654 1 3.38 0.92 2547 ...
## $ BrainWgt: num 5712 6.6 44.5 5.7 4603 ...
## $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
## $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
## $ Sleep : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
## $ Span : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
## $ Gest : num 645 42 60 25 624 180 35 392 63 230 ...
## $ Pred : int 3 3 1 5 3 4 1 4 1 1 ...
## $ Exp : int 5 1 1 2 5 4 1 5 2 1 ...
## $ Danger : int 3 3 1 3 4 4 1 4 1 1 ...
Missing values are included in some of the all variables, like in variable Sleep:
apply(sleep, 2, function(x) any(is.na(x)))
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
## FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## Exp Danger
## FALSE FALSE
How can we impute missing values in the variable Sleep? This might be
done by performing regression fits. Ideally, we already initialize
missing values from a
well-performing algorithm for imputation, for example, a k-nearest
neighbor imputation approach (Templ, Alfons, and Filzmoser 2011).
However, to see the progress of the EM, we start very badly regarding
the initialization of missing values in the variable Sleep, using a
worst initialization just with a large number. Note that we also need
the index of the missing values:
## index of missing values
ind <- data.frame(is.na(sleep))
## initialization
sleep <- kNN(sleep)
## Time difference of 0.04911399 secs
## overwrite missing initialization with bad choice
sleep$Sleep[ind$Sleep] <- 2240 # bad initialization
## initialized missing values in variable sleep
sleep$Sleep[ind$Sleep]
## [1] 2240 2240 2240 2240
We then fit a model for the first variable. The model results (regression coefficients) are then used to predict the missing values
## E-step (1)
lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data =
sleep)
## M-step (1)
sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
## print of updated missing values
sleep$Sleep[ind$Sleep]
## [1] 469.5127 559.9771 408.6845 229.0985
We may continue with the second iteration
## E-step (2)
lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data =
sleep)
## M-step (2)
sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
## print of updated missing values
sleep$Sleep[ind$Sleep]
## [1] 101.9265 121.6146 90.1618 48.7181
We see that the values still change a lot. Let’s do this iteratively until the values will not change more than within a very small threshold. We can write a small function for the imputation of the variable Sleep:
EMregSleep <- function(method = lm, eps = 0.001, init = "bad"){
## index of missing values
ind <- is.na(sleep)
colnames(ind) <- colnames(sleep)
indsleep <- ind[, "Sleep"]
## initialization
if(init == "bad"){
sleep <- kNN(sleep, imp_var = FALSE)
sleep$Sleep[indsleep] <- 2240 # bad initialization
}
if(init == "worst"){
sleep[ind] <- 2240 # worst initialization
}
iteration <- 0
criteria <- 99999
while(criteria > eps){
iteration <- iteration + 1
prev_sol <- sleep$Sleep[indsleep]
## E-step
lm1 <- method(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger,
data = sleep)
## M-step
sleep$Sleep[indsleep] <- predict(lm1)[indsleep]
criteria <- sqrt(sum((prev_sol - sleep$Sleep[indsleep])^2))
}
res <- list("imputed" = sleep,
"iteration" = iteration,
lastmodel = lm1)
return(res)
}
Again we load the data set sleep, impute it and look at the imputed values in the variable Sleep:
data("sleep")
sleepImp <- EMregSleep()
## Time difference of 0.179677 secs
missVals <- sleepImp$imputed$Sleep[ind$Sleep]
missVals
## [1] 3.845778 13.122764 3.658173 16.975766
## [1] 3.845778 13.122764 3.658173 16.975766
sleepImp$iteration
## [1] 11
However, we imputed with expected values, which will lead to a decrease of variance since we didn’t account for the uncertainty and distribution of missing values (compare the variance estimation with missing values in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests). To consider the variance, we sampled residuals (compare the bootstrapping residuals regression approach in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests):
missVals + sample(residuals(sleepImp$lastmodel), length(missVals))
## 7 45 54 36
## 4.122294 13.199528 3.844590 17.827567
Previously we saw that we needed 11 iterations. In addition, the OLS regression model might also be influenced from outliers, thus it is better to replace it with arobust method. We already see good results after the first iteration (not shown here). We get slightly different results that are usually more trustable than using non-robust methods (Templ, Kowarik, and Filzmoser 2011). The OLS results may become corrupted, especially with outliers also in the predictors. However, we see that even with the worst initialization (also huge outliers in the predictors) the results looks fine (although we prefer the robust method anyhow):
data("sleep")
## OLS regression
lm_ols <- EMregSleep(method = lm, init = "worst")
## M-estimation
lm_rlm <- EMregSleep(method = rlm, init = "worst", eps= 0.01)
lm_ols$imputed$Sleep[ind[, "Sleep"]]
## [1] 4.239191 8.169014 4.368256 13.775087
## [1] 4.239191 8.169014 4.368256 13.775087
lm_rlm$imputed$Sleep[ind[, "Sleep"]]
## [1] 3.766792 7.788943 3.925772 13.700029
From these figures we can see that the OLS results are highly influenced by outliers. Compared to the previous estimates (using the bad initialization, not the worst one), the imputed values are too high. This is not as extreme when using an M-estimator, but compared to the implementation in the function irmi (see the following example) we underestimate the second and fourth value.
We have discussed how to impute one variable, but in general we want to impute all variables in a data set. The data set may also consist not only of continuous variables but also of a mix of continuous, semi-continuous, categorical, binary,and/or count variables. The robust EM-based imputation accounts for this (and more, such as specifying a model for each variable) and is implemented in the function irmi (Templ, Kowarik, and Filzmoser 2011) in the R package VIM (Templ, Alfons, and Filzmoser 2011):
data("sleep")
sleepImp <- irmi(sleep)
## Time difference of 0.03798294 secs
sleepImp[ind[, "Sleep"], "Sleep"]
## [1] 3.748899 10.089591 3.156300 17.085060
We see this is very close to the initial solution where we took a better initialization of the missing values. This is an indication of the successfulness of irmi. We may use another method, such as mice (as irmi is usually used for multiple imputation):
library("mice")
##
## Attaching package: 'mice'
## The following objects are masked from 'package:flexclust':
##
## bwplot, densityplot
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: Rcpp
## mice 2.25 2015-11-09
data("sleep")
em_mice <- mice(sleep, m = 1)
##
## iter imp variable
## 1 1 NonD Dream Sleep Span Gest
## 2 1 NonD Dream Sleep Span Gest
## 3 1 NonD Dream Sleep Span Gest
## 4 1 NonD Dream Sleep Span Gest
## 5 1 NonD Dream Sleep Span Gest
## Warning: Number of logged events: 2
em_mice$imp$Sleep
## 1
## 21 12.8
## 31 9.8
## 41 2.6
## 62 11.2
sleep[is.na(sleep)] <- 2240
sleep$Sleep[ind[, "Sleep"]] <- NA
em_mice <- mice(sleep, m = 1)
##
## iter imp variable
## 1 1 Sleep
## 2 1 Sleep
## 3 1 Sleep
## 4 1 Sleep
## 5 1 Sleep
## Warning: Number of logged events: 5
em_mice$imp$Sleep
## 1
## 21 3.9
## 31 2.6
## 41 3.1
## 62 13.8
We see that we get a completely different result, as soon outliers are present in the data set. This is also true for other implementations of the EM algorithm, that are only suitable when the data set is approx. multivariate normal. As soon as this is violated (as typically is the case in practice), irmi might be a good choice. Note that in irmi many other parameters can be specified (not discussed here):
args(irmi)
## function (x, eps = 5, maxit = 100, mixed = NULL, mixed.constant = NULL,
## count = NULL, step = FALSE, robust = FALSE, takeAll = TRUE,
## noise = TRUE, noise.factor = 1, force = FALSE, robMethod = "MM",
## force.mixed = TRUE, mi = 1, addMixedFactors = FALSE, trace = FALSE,
## init.method = "kNN", modelFormulas = NULL, multinom.method = "multinom",
## imp_var = TRUE, imp_suffix = "imp")
## NULL
The EM algorithm is a computation approach to find a solution for maximum likelihood estimators. Basically, the EM algorithm consists of two steps, the E-step for estimation of parameters and the M-step for maximization according to the actual parameters. The algorithm usually converges quickly and is applied in many areas. In this chapter we saw the application in two areas, in clustering and in imputation of missing values. Clustering is an NP-hard problem; loosely speaking, we cannot f ind the exact closed-form solution in a reasonable time. The EM algorithm is therefore necessary to interactively find a good solution. In clustering, the EM algorithm is implemented for the k-means clustering algorithm, but also (not shown in this chapter) for model-based clustering and for mixture models in general. Missing values occur frequently in data sets in practice. Data scientists are probably those people whose main job is in data pre-processing, thus they also have to impute missing values. We saw that the EM algorithm is a central tool for this task