This work implements simple k-means clustering step by step with very cute 3d visualizations that actually took most of the time.
iris_y = iris$Species
iris_X = as.matrix(iris[,1:4])
cnames = colnames(iris_X)
colnames(iris_X) = NULL
n_obs = nrow(iris_X)
n_features = ncol(iris_X)
wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
wine_y = wine[,1]
wine_X = wine[,2:ncol(wine)]
Standard step in any data analysis algorithm - preprocessing. We’ll put the data on hypershpere.
normalize <- function(M) {
#center data
means = apply(M,2,mean)
Xnorm = t(apply(M,1,function(x) {x-means}))
Xnorm
}
encode <- function(M) {
# put on hypershpere
mins = apply(M,2,min)
maxs = apply(M,2,max)
ranges = maxs-mins
Xnorm = t(apply(M,1,function(x) { 2*(x-mins)/ranges-1}))
Xnorm
}
Now we can apply preprocessing procedures:
iris_norm = normalize(iris_X)
iris_norm = encode(iris_norm)
wine_norm = normalize(wine_X)
wine_norm = encode(wine_norm)
This procedure creates k centroids with number of dimensions equal to the number of features. The centrois are initialized to random vectors in the [-1,1] interval.
initCentroids <- function(k,n_features) {
matrix(runif(k*n_features,-1,1), ncol = n_features)
}
To measure proximity between the two data points we’re using simple Euclidean distance: \[d(x_i,x_j) = \sqrt{\sum_{k=1}^d{(x_{ik}-y_{jk})^2}}\] The following procedure calculates the distance between each observation and each centroid. For n observations and k centroids we will get an n by k matrix.
calculate_distances <- function(X, centroids) {
distances = c()
for (i in seq(1,nrow(centroids))){
distance = as.matrix(apply(X,1, function(x) norm(x-centroids[i,],type="2")))
colnames(distance) = paste("centroid",i)
distances = cbind(distances,distance)
}
distances
}
Now we need to apply the results and assign labels based on the distances we’ve calculated. To do that we simply assign the cluster number according to the minimal distance.
# assigns each observation a label based on the minimum distance
# note that there is a label column after this set, so there are 5 columns for iris dataset
assign_labels <- function(X, distances) {
labels = t(apply(distances, 1, which.min))
result = cbind(X,t(labels))
result
}
After all the observation labels has been updated we now update our centroids by calculating the mean inside each cluster and setting it as a new value of the centroid.
update_centroids <- function(X,centroids) {
k = nrow(centroids)
cl = ncol(X)
for (i in 1:k) {
centroids[i,] = colMeans(subset(X,X[,cl]==i,select=1:ncol(X)-1))
}
centroids
}
Now that all the procedures are implemented we can put everything together and run analysis. The following procedure allows to manipulate both k and epsilon, where k is the number of clusters and epsilon is the threshold which tells us when to stop updating centroids (obviously, when they’ve converged to some stable values).
runKmeans <- function(X, k, epsilon) {
centroids = initCentroids(3,ncol(X))
tolerance = 2
while (tolerance > epsilon) {
distances = calculate_distances(X, centroids)
result = assign_labels(X,distances)
print(centroids)
new_centroids = update_centroids(result,centroids)
tolerance = (norm((new_centroids-centroids),type="2"))^2
centroids = new_centroids
}
list(result=result, centroids=centroids)
}
The maximum number of dimensions we can visualise is 3, but even iris contains 4 dimensions. So we need to reduce the number of dimensions somehow. There is a number of ways to do that. We will apply two different but correlated methods. The first one is to choose the 3 dimensions with the highest variance. The second one is to run PCA and take 3 components. We will use the first method for iris and then the second - for wine.
First we’ll run the analysis for the iris data set.
res = runKmeans(iris_norm,3,0.05)
## [,1] [,2] [,3] [,4]
## [1,] 0.1985557 0.8888422 0.1307061 -0.735360
## [2,] 0.2741954 0.9601564 0.2303602 -0.412124
## [3,] 0.7434151 -0.1479175 0.6275669 -0.535386
## [,1] [,2] [,3] [,4]
## [1,] -0.6077778 0.1900000 -0.8433898 -0.8783333
## [2,] 0.0000000 0.1250000 0.3813559 0.6041667
## [3,] 0.0937500 -0.2899306 0.3216808 0.3012153
## [,1] [,2] [,3] [,4]
## [1,] -0.60777778 0.19000000 -0.8433898 -0.8783333
## [2,] 0.34541063 0.02898551 0.5563744 0.7644928
## [3,] 0.01370851 -0.36363636 0.2546775 0.1785714
Now choose dimensions with the highest variance:
ind = sort(apply(iris_norm,2,var),index.return=TRUE, decreasing = TRUE)$ix[1:3]
sset = iris_norm[,ind]
The following piece of code is really hairy and it was not easy to make it work, but it really produces awesome interactive 3D plots. Note that the ellipses shown are NOT the clusters but the areas representing the highest density areas.
library(rgl)
library(car)
x = sset[,1]
y = sset[,2]
z = sset[,3]
labs = cnames[ind]
labels = res$result[,5]
labels_original = iris$Species
par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z,
xlab=labs[1], ylab=labs[2], zlab = labs[3],
labels = NULL,
groups = as.factor(labels),
surface = FALSE,
grid = FALSE,
ellipsoid=TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot.png")
Results
Let’s now compare the results of our k-means clustering to the ‘original’ clusters present in the dataset. The easiest way to do it is by visualization. And we may notice that the results are rather similar.
scatter3d(x = x, y = y, z = z,
xlab=labs[1], ylab=labs[2], zlab = labs[3],
labels = NULL,
groups = as.factor(labels_original),
surface = FALSE,
grid = FALSE,
ellipsoid=TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot_io.png")
Results
res$centroids
## [,1] [,2] [,3] [,4]
## [1,] -0.6077778 0.19000000 -0.8433898 -0.8783333
## [2,] 0.4326599 -0.07070707 0.6240370 0.7095960
## [3,] -0.0787728 -0.37313433 0.1763218 0.1181592
colSums(calculate_distances(res$result[,1:4],res$centroids))
## centroid 1 centroid 2 centroid 3
## 208.7383 202.7415 145.8912
Now let’s do the same for wine dataset:
res = runKmeans(wine_norm,3,0.005)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.3087360 0.02581823 0.1734075 -0.68846616 -0.2376465 -0.6069108
## [2,] 0.3948500 -0.44426018 0.4679337 -0.09596506 0.4346993 -0.4173071
## [3,] 0.9424983 0.23237106 0.3718964 -0.55777813 -0.8182166 -0.4504194
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.7626552 -0.1278269 -0.2457509 0.2370720 0.05361631 -0.7421733
## [2,] -0.1124396 -0.7754719 -0.8376925 0.3274480 0.36040705 0.8283871
## [3,] 0.7804057 -0.4539667 0.7412110 0.1159321 0.51978140 0.4348966
## [,13]
## [1,] 0.6071558
## [2,] -0.3289016
## [3,] 0.3525480
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.12017544 -0.1876273 0.06530546 -0.07310216 -0.3754941 -0.3258098
## [2,] -0.07501385 -0.5049719 0.09665072 -0.04199674 -0.3249428 -0.0107441
## [3,] 0.34179567 -0.3136480 0.00723498 -0.35051546 -0.4271100 0.3509128
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] -0.5400844 0.09891366 -0.4085651 -0.1088272 -0.44010840 -0.4766345
## [2,] -0.1933378 -0.22701092 -0.2531297 -0.5302497 -0.10483526 0.2382495
## [3,] 0.1697692 -0.42508324 0.3312303 -0.3342702 -0.04734577 0.3402284
## [,13]
## [1,] -0.2825833
## [2,] -0.4333659
## [3,] 0.0519426
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.06969147 -0.06010631 0.09810068 0.062388909 -0.3935532
## [2,] -0.28353383 -0.52315076 0.04278075 0.004123711 -0.4071429
## [3,] 0.44842105 -0.51162055 0.09860963 -0.373608247 -0.2317391
## [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] -0.50582640 -0.7745526 0.2244632 -0.54345698 -0.07284924 -0.59893468
## [2,] -0.06482759 -0.2053647 -0.1892183 -0.20892294 -0.67620673 -0.02206736
## [3,] 0.34634483 0.1634599 -0.4407547 0.01526814 -0.23358362 -0.07056911
## [,12] [,13]
## [1,] -0.6569408 -0.5122485
## [2,] 0.2225013 -0.5623395
## [3,] 0.3897436 0.2027389
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.08937799 -0.04311894 0.12027224 0.0766635426 -0.3770751
## [2,] -0.34689709 -0.50115037 -0.02210871 -0.0007693491 -0.4574951
## [3,] 0.44539474 -0.53105590 0.15145149 -0.3381811487 -0.2057453
## [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] -0.51047022 -0.7857307 0.2370497 -0.54344709 -0.03471921 -0.61490022
## [2,] -0.07833248 -0.2231249 -0.1760068 -0.18338905 -0.69357139 -0.05618250
## [3,] 0.29938424 0.1256781 -0.4198113 -0.05723299 -0.26535836 -0.03977933
## [,12] [,13]
## [1,] -0.6781885 -0.5052004
## [2,] 0.1876879 -0.6559925
## [3,] 0.3872318 0.2285765
# compress wine with pca
components = prcomp(wine_norm)$rotation[,1:3]
pcawine = wine_norm%*%(components)
x = pcawine[,1]
y = pcawine[,2]
z = pcawine[,3]
labels = res$result[,ncol(res$result)]
labels_original = wine[,1]
par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z,
groups = as.factor(labels),
surface = FALSE,
grid = FALSE,
ellipsoid = TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot2.png")
Results
And let’s look at the original classess:
par3d("windowRect"= c(0,0,400,400))
scatter3d(x = x, y = y, z = z,
groups = as.factor(labels_original),
surface = FALSE,
grid = FALSE,
ellipsoid = TRUE)
rgl.viewpoint(-10, 25)
rgl.snapshot(filename = "plot2_wine.png")
Results
res$centroids
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.08937799 -0.04311894 0.12027224 0.076663543 -0.3770751
## [2,] -0.36192434 -0.51290761 -0.03709893 0.002899485 -0.4904891
## [3,] 0.42140946 -0.51678167 0.15888698 -0.325004368 -0.1827561
## [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] -0.51047022 -0.7857307 0.2370497 -0.54344709 -0.03471921 -0.61490022
## [2,] -0.08653017 -0.2285733 -0.1709906 -0.20277997 -0.69941873 -0.05741870
## [3,] 0.28907072 0.1138525 -0.4128558 -0.04261348 -0.28078903 -0.03927243
## [,12] [,13]
## [1,] -0.6781885 -0.5052004
## [2,] 0.1830357 -0.6757088
## [3,] 0.3821320 0.2049856
colSums(calculate_distances(res$result[,1:13],res$centroids))
## centroid 1 centroid 2 centroid 3
## 320.8720 277.9593 301.6900
The k-means clustering works well provided we know the number of clusters we want to get. But it’s not always the case and so it’s now always so easy. The key problem with k-means is that it requires us to compute distances at each step and so it could be problematic with large data sets. The other disadvantage is that we will get different clusters if we start with different initial centroids.