Intro

This work implements simple k-means clustering step by step with very cute 3d visualizations that actually took most of the time.

Load data

iris_y = iris$Species
iris_X = as.matrix(iris[,1:4])
cnames = colnames(iris_X)
colnames(iris_X) = NULL
n_obs = nrow(iris_X)
n_features = ncol(iris_X)


wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
wine_y = wine[,1]
wine_X = wine[,2:ncol(wine)]

Normalize

Standard step in any data analysis algorithm - preprocessing. We’ll put the data on hypershpere.

normalize <- function(M) {
        #center data
        means = apply(M,2,mean)
        Xnorm = t(apply(M,1,function(x) {x-means}))
        Xnorm
}

encode <- function(M) {
        # put on hypershpere
        mins = apply(M,2,min)
        maxs = apply(M,2,max)
        ranges = maxs-mins
        Xnorm = t(apply(M,1,function(x) { 2*(x-mins)/ranges-1}))
        Xnorm
        
}

Now we can apply preprocessing procedures:

iris_norm = normalize(iris_X)
iris_norm = encode(iris_norm)
wine_norm = normalize(wine_X)
wine_norm = encode(wine_norm)

Initialize centroids

This procedure creates k centroids with number of dimensions equal to the number of features. The centrois are initialized to random vectors in the [-1,1] interval.

initCentroids <- function(k,n_features) {
        matrix(runif(k*n_features,-1,1), ncol = n_features)
}

Find distances

To measure proximity between the two data points we’re using simple Euclidean distance: \[d(x_i,x_j) = \sqrt{\sum_{k=1}^d{(x_{ik}-y_{jk})^2}}\] The following procedure calculates the distance between each observation and each centroid. For n observations and k centroids we will get an n by k matrix.

calculate_distances <- function(X, centroids) {
        distances = c()
        for (i in seq(1,nrow(centroids))){
                distance = as.matrix(apply(X,1, function(x) norm(x-centroids[i,],type="2")))
                colnames(distance) = paste("centroid",i)
                distances = cbind(distances,distance)
        }
        distances
}

Label observations

Now we need to apply the results and assign labels based on the distances we’ve calculated. To do that we simply assign the cluster number according to the minimal distance.

# assigns each observation a label based on the minimum distance
# note that there is a label column after this set, so there are 5 columns for iris dataset
assign_labels <- function(X, distances) {
        labels = t(apply(distances, 1, which.min))
        result = cbind(X,t(labels))
        result
}

Update centroids

After all the observation labels has been updated we now update our centroids by calculating the mean inside each cluster and setting it as a new value of the centroid.

update_centroids <- function(X,centroids) {
        k = nrow(centroids)
        cl = ncol(X)
        for (i in 1:k) {
                centroids[i,] = colMeans(subset(X,X[,cl]==i,select=1:ncol(X)-1))
        }
        centroids
}

Put everything together

Now that all the procedures are implemented we can put everything together and run analysis. The following procedure allows to manipulate both k and epsilon, where k is the number of clusters and epsilon is the threshold which tells us when to stop updating centroids (obviously, when they’ve converged to some stable values).

runKmeans <- function(X, k, epsilon) {
        centroids = initCentroids(3,ncol(X))
        tolerance = 2
        while (tolerance > epsilon) {
                distances = calculate_distances(X, centroids)
                result = assign_labels(X,distances)
                print(centroids)
                new_centroids = update_centroids(result,centroids)
                tolerance = (norm((new_centroids-centroids),type="2"))^2
                centroids = new_centroids
        }
        list(result=result, centroids=centroids)
}

Data Visualization

The maximum number of dimensions we can visualise is 3, but even iris contains 4 dimensions. So we need to reduce the number of dimensions somehow. There is a number of ways to do that. We will apply two different but correlated methods. The first one is to choose the 3 dimensions with the highest variance. The second one is to run PCA and take 3 components. We will use the first method for iris and then the second - for wine.

First we’ll run the analysis for the iris data set.

res = runKmeans(iris_norm,3,0.05)
##           [,1]       [,2]      [,3]      [,4]
## [1,] 0.1985557  0.8888422 0.1307061 -0.735360
## [2,] 0.2741954  0.9601564 0.2303602 -0.412124
## [3,] 0.7434151 -0.1479175 0.6275669 -0.535386
##            [,1]       [,2]       [,3]       [,4]
## [1,] -0.6077778  0.1900000 -0.8433898 -0.8783333
## [2,]  0.0000000  0.1250000  0.3813559  0.6041667
## [3,]  0.0937500 -0.2899306  0.3216808  0.3012153
##             [,1]        [,2]       [,3]       [,4]
## [1,] -0.60777778  0.19000000 -0.8433898 -0.8783333
## [2,]  0.34541063  0.02898551  0.5563744  0.7644928
## [3,]  0.01370851 -0.36363636  0.2546775  0.1785714

Now choose dimensions with the highest variance:

ind = sort(apply(iris_norm,2,var),index.return=TRUE, decreasing = TRUE)$ix[1:3]
sset = iris_norm[,ind]

The following piece of code is really hairy and it was not easy to make it work, but it really produces awesome interactive 3D plots. Note that the ellipses shown are NOT the clusters but the areas representing the highest density areas.

library(rgl)
library(car)
x = sset[,1]
y = sset[,2]
z = sset[,3]
labs = cnames[ind]
labels = res$result[,5]
labels_original = iris$Species
par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z, 
          xlab=labs[1], ylab=labs[2], zlab = labs[3],
          labels = NULL,
          groups = as.factor(labels), 
          surface = FALSE, 
          grid = FALSE,
          ellipsoid=TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot.png")
Results

Results

Let’s now compare the results of our k-means clustering to the ‘original’ clusters present in the dataset. The easiest way to do it is by visualization. And we may notice that the results are rather similar.

scatter3d(x = x, y = y, z = z, 
          xlab=labs[1], ylab=labs[2], zlab = labs[3],
          labels = NULL,
          groups = as.factor(labels_original), 
          surface = FALSE, 
          grid = FALSE,
          ellipsoid=TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot_io.png")
Results

Results

Results

res$centroids
##            [,1]        [,2]       [,3]       [,4]
## [1,] -0.6077778  0.19000000 -0.8433898 -0.8783333
## [2,]  0.4326599 -0.07070707  0.6240370  0.7095960
## [3,] -0.0787728 -0.37313433  0.1763218  0.1181592
colSums(calculate_distances(res$result[,1:4],res$centroids))
## centroid 1 centroid 2 centroid 3 
##   208.7383   202.7415   145.8912

Now let’s do the same for wine dataset:

res = runKmeans(wine_norm,3,0.005)
##           [,1]        [,2]      [,3]        [,4]       [,5]       [,6]
## [1,] 0.3087360  0.02581823 0.1734075 -0.68846616 -0.2376465 -0.6069108
## [2,] 0.3948500 -0.44426018 0.4679337 -0.09596506  0.4346993 -0.4173071
## [3,] 0.9424983  0.23237106 0.3718964 -0.55777813 -0.8182166 -0.4504194
##            [,7]       [,8]       [,9]     [,10]      [,11]      [,12]
## [1,]  0.7626552 -0.1278269 -0.2457509 0.2370720 0.05361631 -0.7421733
## [2,] -0.1124396 -0.7754719 -0.8376925 0.3274480 0.36040705  0.8283871
## [3,]  0.7804057 -0.4539667  0.7412110 0.1159321 0.51978140  0.4348966
##           [,13]
## [1,]  0.6071558
## [2,] -0.3289016
## [3,]  0.3525480
##             [,1]       [,2]       [,3]        [,4]       [,5]       [,6]
## [1,]  0.12017544 -0.1876273 0.06530546 -0.07310216 -0.3754941 -0.3258098
## [2,] -0.07501385 -0.5049719 0.09665072 -0.04199674 -0.3249428 -0.0107441
## [3,]  0.34179567 -0.3136480 0.00723498 -0.35051546 -0.4271100  0.3509128
##            [,7]        [,8]       [,9]      [,10]       [,11]      [,12]
## [1,] -0.5400844  0.09891366 -0.4085651 -0.1088272 -0.44010840 -0.4766345
## [2,] -0.1933378 -0.22701092 -0.2531297 -0.5302497 -0.10483526  0.2382495
## [3,]  0.1697692 -0.42508324  0.3312303 -0.3342702 -0.04734577  0.3402284
##           [,13]
## [1,] -0.2825833
## [2,] -0.4333659
## [3,]  0.0519426
##             [,1]        [,2]       [,3]         [,4]       [,5]
## [1,]  0.06969147 -0.06010631 0.09810068  0.062388909 -0.3935532
## [2,] -0.28353383 -0.52315076 0.04278075  0.004123711 -0.4071429
## [3,]  0.44842105 -0.51162055 0.09860963 -0.373608247 -0.2317391
##             [,6]       [,7]       [,8]        [,9]       [,10]       [,11]
## [1,] -0.50582640 -0.7745526  0.2244632 -0.54345698 -0.07284924 -0.59893468
## [2,] -0.06482759 -0.2053647 -0.1892183 -0.20892294 -0.67620673 -0.02206736
## [3,]  0.34634483  0.1634599 -0.4407547  0.01526814 -0.23358362 -0.07056911
##           [,12]      [,13]
## [1,] -0.6569408 -0.5122485
## [2,]  0.2225013 -0.5623395
## [3,]  0.3897436  0.2027389
##             [,1]        [,2]        [,3]          [,4]       [,5]
## [1,]  0.08937799 -0.04311894  0.12027224  0.0766635426 -0.3770751
## [2,] -0.34689709 -0.50115037 -0.02210871 -0.0007693491 -0.4574951
## [3,]  0.44539474 -0.53105590  0.15145149 -0.3381811487 -0.2057453
##             [,6]       [,7]       [,8]        [,9]       [,10]       [,11]
## [1,] -0.51047022 -0.7857307  0.2370497 -0.54344709 -0.03471921 -0.61490022
## [2,] -0.07833248 -0.2231249 -0.1760068 -0.18338905 -0.69357139 -0.05618250
## [3,]  0.29938424  0.1256781 -0.4198113 -0.05723299 -0.26535836 -0.03977933
##           [,12]      [,13]
## [1,] -0.6781885 -0.5052004
## [2,]  0.1876879 -0.6559925
## [3,]  0.3872318  0.2285765
# compress wine with pca
components = prcomp(wine_norm)$rotation[,1:3]
pcawine = wine_norm%*%(components)
x = pcawine[,1]
y = pcawine[,2]
z = pcawine[,3]
labels = res$result[,ncol(res$result)]
labels_original = wine[,1]
par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z, 
          groups = as.factor(labels), 
          surface = FALSE, 
          grid = FALSE,
          ellipsoid = TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot2.png")
Results

Results

And let’s look at the original classess:

par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z, 
          groups = as.factor(labels_original), 
          surface = FALSE, 
          grid = FALSE,
          ellipsoid = TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot2_wine.png")
Results

Results

Results

res$centroids
##             [,1]        [,2]        [,3]         [,4]       [,5]
## [1,]  0.08937799 -0.04311894  0.12027224  0.076663543 -0.3770751
## [2,] -0.36192434 -0.51290761 -0.03709893  0.002899485 -0.4904891
## [3,]  0.42140946 -0.51678167  0.15888698 -0.325004368 -0.1827561
##             [,6]       [,7]       [,8]        [,9]       [,10]       [,11]
## [1,] -0.51047022 -0.7857307  0.2370497 -0.54344709 -0.03471921 -0.61490022
## [2,] -0.08653017 -0.2285733 -0.1709906 -0.20277997 -0.69941873 -0.05741870
## [3,]  0.28907072  0.1138525 -0.4128558 -0.04261348 -0.28078903 -0.03927243
##           [,12]      [,13]
## [1,] -0.6781885 -0.5052004
## [2,]  0.1830357 -0.6757088
## [3,]  0.3821320  0.2049856
colSums(calculate_distances(res$result[,1:13],res$centroids))
## centroid 1 centroid 2 centroid 3 
##   320.8720   277.9593   301.6900

Summmary

The k-means clustering works well provided we know the number of clusters we want to get. But it’s not always the case and so it’s now always so easy. The key problem with k-means is that it requires us to compute distances at each step and so it could be problematic with large data sets. The other disadvantage is that we will get different clusters if we start with different initial centroids.