Probably the most famous algorithm for clustering observations to groups is the k-means algorithm. We will see that this algorithm is just a variant of the EM algorithm. Given \(n\) objects, characterized by \(p\) variables, we like to partition them into \(n_C\) clusters \(\{C_1, C_2, ..., C_{n_C}\}\) such that cluster \(C_k\) has \(n_{(k)}\) members and each observation is in one cluster. The mean vector (center, prototype), \(\mathbf{V}_k\), of a cluster \(C_k\) is defined as the centroid of the cluster and the components of the mean vector can be calculated by:
\[ \mathbf{v}_k (\in \mathbb{R}^p) = \left( \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{i1}^{(k)}, \ldots, \frac{1}{n_{(k)}} \sum_{i=1}^{n_{(k)}} x_{ip}^{(k)} \right) \]
where \(n_{(k)}\) is the number of observations in cluster \(C_k\) and \(\mathbf{x}_i^{(k)}\) is the \(i\)-th observation belonging to cluster \(C_k\). For each cluster \(C_1, ..., C_{n_C}\) the corresponding cluster means \(\mathbf{V} = \{\mathbf{v}_1, ..., \mathbf{v}_{n_C}\}\) are calculated.
We also need to determine the number of clusters in the output partition. Starting from the given initial locations of the \(n_C\) cluster centroids, the algorithm uses the data points to iteratively relocate the centroids and reallocate points to the closest centroid. The process is composed of these steps:
\[ J(\mathbf{X}, \mathbf{V}, \mathbf{U}) = \sum_{k=1}^{n_c} \sum_{i=1}^{n} u_{ik} \, d^2(\mathbf{x}_i, \mathbf{v}_k) \]
where \(\mathbf{X} = \{\mathbf{x}_1,
\ldots, \mathbf{x}_n\}\) is the data set with observation and
variables, \(\mathbf{v}_1, \ldots,
\mathbf{v}_{n_c}\) is the matrix of cluster centers (prototypes)
of dimension \(n_c \times p\).
\(\mathbf{U} = [(u_{ik})]\) is a matrix
with the membership coefficients \(u_{ik}\) for observation \(\mathbf{x}_i\) to a cluster \(n_c\).
\(\mathbf{U}\) is therefore of
dimension \(n \times n_c\).
\(d\) is the Euclidean distance between
the observation and the cluster center.
\(n_c\) determines the number of
clusters.
The k-means algorithm can be implemented as follows.
Fix \(n_c\), \(2 \leq n_c < n\), and choose the
termination tolerance \(\delta >
0\), for example, \(0.001\).
Initialize \(\mathbf{U}^{(0)}\) (for
example, randomly).
REPEAT for \(r = 1, 2, \ldots\)
\[\mathbf{v}_k = \frac{1}{\sum_{i=1}^{n} u_{ik}} \left(\sum_{i=1}^{n} u_{ik} x_{i1}, \ldots, \sum_{i=1}^{n} u_{ik} x_{ip} \right), \quad \text{for } k = 1, \ldots, n_c\] 2. M-step: Update \(\mathbf{U}^{(r)}\): Reallocate the cluster memberships.
\[ u_{ij}^{(r)} = \begin{cases} 1 & \text{if } d(\mathbf{x}_i, \mathbf{v}_j^{(r)}) = \min\limits_{1 \leq l \leq n_c} d(\mathbf{x}_i, \mathbf{v}_l^{(r)}) \\ 0 & \text{otherwise} \end{cases} \]
UNTIL $ | ^{(r)} - ^{(r-1)} | < $
We define the whole algorithm of k-means in the following.
This is just to get in touch with the code and algorithm.
For professional implementations of the k-means, see for example,
(Leisch 2006).
For the cluster algorithm we need a distance function.
We use the Manhattan distance:
distMan <- function(x, centers){
if(class(x) == "data.frame") x <- as.matrix(x)
d <- matrix(0, nrow=nrow(x), ncol=nrow(centers))
## dist center to observations for each cluster
for(k in 1:nrow(centers)){
d[,k] <- abs( colSums((t(x) - centers[k,])) )
}
return(d)
}
And we need a function that calculates means, for example, we may use the arithmetic mean, but we may also use the median as in the following:
means <- function(x, cluster){
cen <- matrix(NA, nrow=max(cluster), ncol <- ncol(x))
## cluster means for each cluster
for(n in 1:max(cluster)){
cen[n,] <- apply(x[cluster==n,], 2, median)
}
return(cen)
}
We write a function for the k-means algorithm, which implements the formulas before. In order to make some plots show the EM-approach in detail, we do it in a for loop :
my_kmeans <- function(x, k, clmeans = means, distance = distMan, iter =
99999, seed = NULL){
if(!is.null(seed)) set.seed(seed)
cent <- newcent <- x[sample(1:nrow(x), size=k), ]
oldclust <- 0
j <- 0
for(i in 1:iter){ # better: while()
j <- j + 1
cent <- newcent
## M-step
dist <- distance(x, cent)
clust <- max.col(-dist)
## E-step
newcent <- clmeans(x, clust)
if(all(clust == oldclust)) break()
oldclust <- clust
}
res <- list(centers = cent,
cluster = factor(clust),
iterations = j)
return(res)
}
As we can see, in k-means clustering the E-step is the fitting step and the M-step is the assignment step. Iterating the E and M-step iteratively improves the solution. This means that \(J(\mathbf{X}, \mathbf{V}, \mathbf{U})\) gets smaller in each iteration. We break the algorithm if the cluster assignment is not changing anymore. Let’s crap some data. We want to keep it simple and we want to show the clusters visually. So we’ve taken a two-dimensional data set and shown it in Figure 9.1.
data(Nclus, package = "flexclust")
x <- data.frame(Nclus)
library("ggplot2")
qplot(X1, X2, data=data.frame(Nclus))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In the following we plot the results after iterations 1, 2, and after convergence. Instead of our simplified implementation of the k-means algorithm, we use the default k-means implementation of R. Some variants of the k-means exist, where we chose the algorithm of “MacQueen”, but only in order to explore the algorithm (the default method, “Hartigan-Wong” is converging too fast to show the steps of the algorithm). Note that the k-means algorithm starts with randomly chosen cluster centers. Thus we have to set the seed to ensure the same starts in each call of the k-means:
set.seed(123456)
cl1 <- kmeans(x, centers = 4, iter.max = 1, algorithm = "MacQueen")
## Warning: did not converge in 1 iteration
set.seed(123456)
cl2 <- kmeans(x, centers = 4, iter.max = 2, algorithm = "MacQueen")
## Warning: did not converge in 2 iterations
set.seed(123456)
cl3 <- kmeans(x, centers = 4, iter.max = 3, algorithm = "MacQueen")
## Warning: did not converge in 3 iterations
set.seed(123456)
cl4 <- kmeans(x, centers = 4, algorithm = "MacQueen")
cl1$centers
## X1 X2
## 1 1.4080304 1.9778530
## 2 -1.9756156 6.0385730
## 3 0.1149815 -0.7912543
## 4 5.7943888 2.6065412
Note that the k-means only takes the centers into account and works with a distance function to calculate the distance from the observations to the cluster centers. Another (more favorable) approach is to incorporate them within the shape of the clusters. This is implemented in the model-based clustering framework (Fraley and Raftery 2002). The model-based procedures mostly give better clustering results (Templ, Filzmoser, and Reimann 2008) but they are computationally more complex since in each E-step the covariance of each cluster must be additionally calculated.
The EM algorithm is extensively used for the imputation of missing values. Implementations include (van Buuren and Groothuis-Oudshoorn 2011), (Schafer 1997), (Templ, Alfons, and Filzmoser 2011), (Raghunathan et al. 2001), and (Gelman and Hill 2011). In the following we want to show how an EM algorithm works generally for these kind of problems. First we take a data set to impute. We select again the sleep data:
library("MASS")
library("robustbase")
## Warning: package 'robustbase' was built under R version 4.3.3
library("VIM")
## Warning: package 'VIM' was built under R version 4.3.3
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.3.3
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
data("sleep")
str(sleep)
## 'data.frame': 62 obs. of 10 variables:
## $ BodyWgt : num 6654 1 3.38 0.92 2547 ...
## $ BrainWgt: num 5712 6.6 44.5 5.7 4603 ...
## $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
## $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
## $ Sleep : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
## $ Span : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
## $ Gest : num 645 42 60 25 624 180 35 392 63 230 ...
## $ Pred : int 3 3 1 5 3 4 1 4 1 1 ...
## $ Exp : int 5 1 1 2 5 4 1 5 2 1 ...
## $ Danger : int 3 3 1 3 4 4 1 4 1 1 ...
Missing values are included in some of the all variables, like in variable Sleep:
apply(sleep, 2, function(x) any(is.na(x)))
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
## FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
## Exp Danger
## FALSE FALSE
How can we impute missing values in the variable Sleep? This might be done by performing regression fits. Ideally, we already initialize missing values from a well-performing algorithm for imputation, for example, a k-nearest neighbor imputation approach (Templ, Alfons, and Filzmoser 2011). However, to see the progress of the EM, we start very badly regarding the initialization of missing values in the variable Sleep, using a worst initialization just with a large number. Note that we also need the index of the missing values:
## index of missing values
ind <- data.frame(is.na(sleep))
## initialization
sleep <- kNN(sleep)
## Time difference of 0.04911399 secs
## overwrite missing initialization with bad choice
sleep$Sleep[ind$Sleep] <- 2240 # bad initialization
## initialized missing values in variable sleep
sleep$Sleep[ind$Sleep]
## [1] 2240 2240 2240 2240
We then fit a model for the first variable. The model results (regression coefficients) are then used to predict the missing values:
## E-step (1)
lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data =
sleep)
## M-step (1)
sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
## print of updated missing values
sleep$Sleep[ind$Sleep]
## [1] 469.5127 559.9771 408.6845 229.0985
We may continue with the second iteration.
## E-step (2)
lm1 <- lm(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger, data =
sleep)
## M-step (2)
sleep$Sleep[ind$Sleep] <- predict(lm1)[ind$Sleep]
## print of updated missing values
sleep$Sleep[ind$Sleep]
## [1] 101.9265 121.6146 90.1618 48.7181
We see that the values still change a lot. Let’s do this iteratively until the values will not change more than within a very small threshold. We can write a small function for the imputation of the variable Sleep:
EMregSleep <- function(method = lm, eps = 0.001, init = "bad"){
## index of missing values
ind <- is.na(sleep)
colnames(ind) <- colnames(sleep)
indsleep <- ind[, "Sleep"]
## initialization
if(init == "bad"){
sleep <- kNN(sleep, imp_var = FALSE)
sleep$Sleep[indsleep] <- 2240 # bad initialization
}
if(init == "worst"){
sleep[ind] <- 2240 # worst initialization
}
iteration <- 0
criteria <- 99999
while(criteria > eps){
iteration <- iteration + 1
prev_sol <- sleep$Sleep[indsleep]
## E-step
lm1 <- method(Sleep ~ log(BodyWgt) + log(BrainWgt) + NonD + Danger,
data = sleep)
## M-step
sleep$Sleep[indsleep] <- predict(lm1)[indsleep]
criteria <- sqrt(sum((prev_sol - sleep$Sleep[indsleep])^2))
}
res <- list("imputed" = sleep,
"iteration" = iteration,
lastmodel = lm1)
return(res)
}
Again we load the data set sleep, impute it and look at the imputed values in the variable Sleep:
data("sleep")
sleepImp <- EMregSleep()
## Time difference of 0.179677 secs
missVals <- sleepImp$imputed$Sleep[ind$Sleep]
missVals
## [1] 3.845778 13.122764 3.658173 16.975766
sleepImp$iteration
## [1] 11
However, we imputed with expected values, which will lead to a decrease of variance since we didn’t account for the uncertainty and distribution of missing values (compare the variance estimation with missing values in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests). To consider the variance, we sampled residuals (compare the bootstrapping residuals regression approach in Chapter 8, Applications of Resampling Methods and Monte Carlo Tests):
missVals + sample(residuals(sleepImp$lastmodel), length(missVals))
## 7 45 54 36
## 4.122294 13.199528 3.844590 17.827567
Previously we saw that we needed 11 iterations. In addition, the OLS regression model might also be influenced from outliers, thus it is better to replace it with arobust method. We already see good results after the first iteration (not shown here). We get slightly different results that are usually more trustable than using non-robust methods (Templ, Kowarik, and Filzmoser 2011). The OLS results may become corrupted, especially with outliers also in the predictors. However, we see that even with the worst initialization (also huge outliers in the predictors) the results looks fine (although we prefer the robust method anyhow):
data("sleep")
## OLS regression
lm_ols <- EMregSleep(method = lm, init = "worst")
## M-estimation
lm_rlm <- EMregSleep(method = rlm, init = "worst", eps= 0.01)
lm_ols$imputed$Sleep[ind[, "Sleep"]]
## [1] 4.239191 8.169014 4.368256 13.775087
## [1] 4.239191 8.169014 4.368256 13.775087
lm_rlm$imputed$Sleep[ind[, "Sleep"]]
## [1] 3.766792 7.788943 3.925772 13.700029
## [1] 3.766792 7.788943 3.925772 13.700029
From these figures we can see that the OLS results are highly influenced by outliers. Compared to the previous estimates (using the bad initialization, not the worst one), the imputed values are too high. This is not as extreme when using an M-estimator, but compared to the implementation in the function irmi (see the following example) we underestimate the second and fourth value.
We have discussed how to impute one variable, but in general we want to impute all variables in a data set. The data set may also consist not only of continuous variables but also of a mix of continuous, semi-continuous, categorical, binary,and/or count variables. The robust EM-based imputation accounts for this (and more, such as specifying a model for each variable) and is implemented in the function irmi (Templ, Kowarik, and Filzmoser 2011) in the R package VIM (Templ, Alfons, and Filzmoser 2011):
data("sleep")
sleepImp <- irmi(sleep)
## Time difference of 0.03798294 secs
sleepImp[ind[, "Sleep"], "Sleep"]
## [1] 3.748899 10.089591 3.156300 17.085060
We see this is very close to the initial solution where we took a better initialization of the missing values. This is an indication of the successfulness of irmi. We may use another method, such as mice (as irmi is usually used for multiple imputation):
library("mice")
## Warning: package 'mice' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: Rcpp
## mice 2.25 2015-11-09
data("sleep")
em_mice <- mice(sleep, m = 1)
##
## iter imp variable
## 1 1 NonD Dream Sleep Span Gest
## 2 1 NonD Dream Sleep Span Gest
## 3 1 NonD Dream Sleep Span Gest
## 4 1 NonD Dream Sleep Span Gest
## 5 1 NonD Dream Sleep Span Gest
## Warning: Number of logged events: 2
em_mice$imp$Sleep
## 1
## 21 12.8
## 31 9.8
## 41 2.6
## 62 11.2
## now with bad intitialisation in predictors
sleep[is.na(sleep)] <- 2240
sleep$Sleep[ind[, "Sleep"]] <- NA
em_mice <- mice(sleep, m = 1)
##
## iter imp variable
## 1 1 Sleep
## 2 1 Sleep
## 3 1 Sleep
## 4 1 Sleep
## 5 1 Sleep
## Warning: Number of logged events: 5
em_mice$imp$Sleep
## 1
## 21 3.9
## 31 2.6
## 41 3.1
## 62 13.8
We see that we get a completely different result, as soon outliers are present in the data set. This is also true for other implementations of the EM algorithm, that are only suitable when the data set is approx. multivariate normal. As soon as this is violated (as typically is the case in practice), irmi might be a good choice. Note that in irmi many other parameters can be specified (not discussed here):
args(irmi)
## function (x, eps = 5, maxit = 100, mixed = NULL, mixed.constant = NULL,
## count = NULL, step = FALSE, robust = FALSE, takeAll = TRUE,
## noise = TRUE, noise.factor = 1, force = FALSE, robMethod = "MM",
## force.mixed = TRUE, mi = 1, addMixedFactors = FALSE, trace = FALSE,
## init.method = "kNN", modelFormulas = NULL, multinom.method = "multinom",
## imp_var = TRUE, imp_suffix = "imp")
## NULL
The EM algorithm is a computation approach to find a solution for maximum likelihood estimators. Basically, the EM algorithm consists of two steps, the E-step for estimation of parameters and the M-step for maximization according to the actual parameters. The algorithm usually converges quickly and is applied in many areas.
In this chapter we saw the application in two areas, in clustering and in imputation of missing values. Clustering is an NP-hard problem; loosely speaking, we cannot find the exact closed-form solution in a reasonable time. The EM algorithm is therefore necessary to interactively find a good solution. In clustering, the EM algorithm is implemented for the k-means clustering algorithm, but also (not shown in this chapter) for model-based clustering and for mixture models in general.
Missing values occur frequently in data sets in practice. Data scientists are probably those people whose main job is in data pre-processing, thus they also have to impute missing values. We saw that the EM algorithm is a central tool for this task.