library(swirl)
swirl()

data clustering steps:

  1. guess how many clusters you have/want 2.randomly create a “centroid” (a phantom point) for each cluster and assign each point or observation in your dataset to the centroid to which it is closest. 3.readjust the centroid’s position by making it the average of the points assigned to it.
  2. recalculate the distance of the observations to the centroids and reassign any to the centroid closest to them.
  3. readjust the positions of the centroids based on the new cluster membership. The process stops once you reach an iteration in which no adjustments are made or when you’ve reached some predetermined maximum number of iterations.

Matrix of coordinates

cx<-c(1, 1.8,  2.5)
cy<-c(2,  1.0,  1.5)
cmat<- matrix(cx, cy)
cmat
     [,1] [,2] [,3]
[1,]    1  1.8  2.5
[2,]    2  1.0  1.5

Putting the points on the graph

points(cx, cy, col = c("red", "orange", "purple"), pch = 3, cex = 2, lwd = 2)

3 centroind and 12 datapoints -> 36 distances

mdist function

mdist<-function(x,y,cx,cy){
  distTmp <- matrix(NA,nrow=3,ncol=12)
  distTmp[1,] <- (x-cx[1])^2 + (y-cy[1])^2
  distTmp[2,] <- (x-cx[2])^2 + (y-cy[2])^2
  distTmp[3,] <- (x-cx[3])^2 + (y-cy[3])^2  
  return(distTmp)
}

to assign a cluster to each point, we’ll look at each column and pick the minimum entry

apply(distTmp,2,which.min)

To colour the 12 data points according to their assignments

points(x,y, pch=19, cex = 2, col = cols1[newClust])

Now we have to recalculate our centroids so they are the average | (center of gravity) of the cluster of points assigned to them. We | have to do the x and y coordinates separately. To vectors x,y

tapply(x, newClust, mean)
tapply(y, newClust, mean)
points(newCx, newCy, col = cols1, pch = 8, cex = 2, lwd = 2)
mdist(x,y, newCx, newCy)
distTmp2

Now call apply with 3 arguments, distTmp2, 2, and which.min to find | the new cluster assignments for the points.

apply(distTmp2, 2, which.min)
newClust2

Now use tapply to find the x and y coordinate of the new centroid

tapply(x, newClust2, mean)
tapply(y, newClust2, mean)
finalCx
finalCy

new centroids

 points(finalCx, finalCy, col = cols1, pch = 9, cex = 2, lwd = 2)

Now that you’ve gone through an example step by step, you’ll be
relieved to hear that R provides a command to do all this work for
you. Unsurprisingly it’s called kmeans and, although it has several
parameters, we’ll just mention four. These are x, (the numeric matrix
of data), centers, iter.max, and nstart. The second of these
(centers) can be either a number of clusters or a set of initial
centroids. The third, iter.max, specifies the maximum number of
iterations to go through, and nstart is the number of random starts
you want to try if you specify centers as a number.

kmeans(dataFrame, center = 3)
kmObj$iter
 plot(x,y, col =kmObj$cluster, pch= 19, cex = 2 )
points(kmObj$centers, col=c("black","red","green"), pch = 3, cex = 3, lwd = 3)
plot(x,y, col =  kmeans(dataFrame,6)$cluster, pch = 19, cex = 2 )
LS0tCnRpdGxlOiAiSy1tZWFucyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKYGBge3J9CmxpYnJhcnkoc3dpcmwpCnN3aXJsKCkKYGBgCgogZGF0YSBjbHVzdGVyaW5nIHN0ZXBzOgoKMS4gZ3Vlc3MgaG93IG1hbnkgY2x1c3RlcnMgeW91IGhhdmUvd2FudAoyLnJhbmRvbWx5IGNyZWF0ZSBhICJjZW50cm9pZCIgKGEgcGhhbnRvbSBwb2ludCkgZm9yIGVhY2ggY2x1c3RlciBhbmQgYXNzaWduIGVhY2ggcG9pbnQgb3Igb2JzZXJ2YXRpb24gaW4geW91ciBkYXRhc2V0IHRvIHRoZSBjZW50cm9pZCB0byB3aGljaCBpdCBpcyBjbG9zZXN0LgozLnJlYWRqdXN0IHRoZSBjZW50cm9pZCdzIHBvc2l0aW9uIGJ5IG1ha2luZyBpdCB0aGUgYXZlcmFnZSBvZiB0aGUgcG9pbnRzIGFzc2lnbmVkIHRvIGl0Lgo0LiByZWNhbGN1bGF0ZSB0aGUgZGlzdGFuY2Ugb2YgdGhlIG9ic2VydmF0aW9ucyB0byB0aGUgY2VudHJvaWRzIGFuZCByZWFzc2lnbiBhbnkgIHRvIHRoZSBjZW50cm9pZCBjbG9zZXN0IHRvIHRoZW0uCjUuIHJlYWRqdXN0IHRoZSBwb3NpdGlvbnMgb2YgdGhlIGNlbnRyb2lkcyBiYXNlZCBvbiB0aGUgbmV3IGNsdXN0ZXIgbWVtYmVyc2hpcC4KVGhlIHByb2Nlc3Mgc3RvcHMgb25jZSB5b3UgcmVhY2ggYW4gaXRlcmF0aW9uIGluIHdoaWNoIG5vIGFkanVzdG1lbnRzIGFyZSBtYWRlIG9yIHdoZW4geW91J3ZlIHJlYWNoZWQgc29tZSBwcmVkZXRlcm1pbmVkIG1heGltdW0gbnVtYmVyIG9mIGl0ZXJhdGlvbnMuCgoKTWF0cml4IG9mIGNvb3JkaW5hdGVzCmBgYHtyfQpjeDwtYygxLCAxLjgsICAyLjUpCmN5PC1jKDIsICAxLjAsICAxLjUpCmNtYXQ8LSBtYXRyaXgoY3gsIGN5KQpjbWF0CiAgICAgWywxXSBbLDJdIFssM10KWzEsXSAgICAxICAxLjggIDIuNQpbMixdICAgIDIgIDEuMCAgMS41CmBgYAoKClB1dHRpbmcgdGhlIHBvaW50cyBvbiB0aGUgZ3JhcGgKYGBge3J9CnBvaW50cyhjeCwgY3ksIGNvbCA9IGMoInJlZCIsICJvcmFuZ2UiLCAicHVycGxlIiksIHBjaCA9IDMsIGNleCA9IDIsIGx3ZCA9IDIpCmBgYAoKMyBjZW50cm9pbmQgYW5kIDEyIGRhdGFwb2ludHMgLT4gMzYgZGlzdGFuY2VzCgptZGlzdCBmdW5jdGlvbgpgYGB7cn0KbWRpc3Q8LWZ1bmN0aW9uKHgseSxjeCxjeSl7CiAgZGlzdFRtcCA8LSBtYXRyaXgoTkEsbnJvdz0zLG5jb2w9MTIpCiAgZGlzdFRtcFsxLF0gPC0gKHgtY3hbMV0pXjIgKyAoeS1jeVsxXSleMgogIGRpc3RUbXBbMixdIDwtICh4LWN4WzJdKV4yICsgKHktY3lbMl0pXjIKICBkaXN0VG1wWzMsXSA8LSAoeC1jeFszXSleMiArICh5LWN5WzNdKV4yICAKICByZXR1cm4oZGlzdFRtcCkKfQpgYGAKCnRvIGFzc2lnbiBhIGNsdXN0ZXIgdG8gZWFjaCBwb2ludCwgd2UnbGwgbG9vayBhdCBlYWNoIGNvbHVtbiBhbmQgcGljayB0aGUgbWluaW11bSBlbnRyeQoKCmBgYHtyfQphcHBseShkaXN0VG1wLDIsd2hpY2gubWluKQpgYGAKClRvIGNvbG91ciB0aGUgMTIgZGF0YSBwb2ludHMgYWNjb3JkaW5nIHRvIHRoZWlyIGFzc2lnbm1lbnRzCmBgYHtyfQpwb2ludHMoeCx5LCBwY2g9MTksIGNleCA9IDIsIGNvbCA9IGNvbHMxW25ld0NsdXN0XSkKYGBgCgogTm93IHdlIGhhdmUgdG8gcmVjYWxjdWxhdGUgb3VyIGNlbnRyb2lkcyBzbyB0aGV5IGFyZSB0aGUgYXZlcmFnZQp8IChjZW50ZXIgb2YgZ3Jhdml0eSkgb2YgdGhlIGNsdXN0ZXIgb2YgcG9pbnRzIGFzc2lnbmVkIHRvIHRoZW0uIFdlCnwgaGF2ZSB0byBkbyB0aGUgeCBhbmQgeSBjb29yZGluYXRlcyBzZXBhcmF0ZWx5LiBUbyB2ZWN0b3JzIHgseQoKYGBge3J9CnRhcHBseSh4LCBuZXdDbHVzdCwgbWVhbikKdGFwcGx5KHksIG5ld0NsdXN0LCBtZWFuKQpwb2ludHMobmV3Q3gsIG5ld0N5LCBjb2wgPSBjb2xzMSwgcGNoID0gOCwgY2V4ID0gMiwgbHdkID0gMikKYGBgCgoKYGBge3J9Cm1kaXN0KHgseSwgbmV3Q3gsIG5ld0N5KQpkaXN0VG1wMgpgYGAKCk5vdyBjYWxsIGFwcGx5IHdpdGggMyBhcmd1bWVudHMsIGRpc3RUbXAyLCAyLCBhbmQgd2hpY2gubWluIHRvIGZpbmQKfCB0aGUgbmV3IGNsdXN0ZXIgYXNzaWdubWVudHMgZm9yIHRoZSBwb2ludHMuCgpgYGB7cn0KYXBwbHkoZGlzdFRtcDIsIDIsIHdoaWNoLm1pbikKbmV3Q2x1c3QyCmBgYAoKTm93IHVzZSB0YXBwbHkgdG8gZmluZCB0aGUgeCBhbmQgeSBjb29yZGluYXRlIG9mIHRoZSBuZXcgY2VudHJvaWQKYGBge3J9CnRhcHBseSh4LCBuZXdDbHVzdDIsIG1lYW4pCnRhcHBseSh5LCBuZXdDbHVzdDIsIG1lYW4pCmZpbmFsQ3gKZmluYWxDeQpgYGAKCm5ldyBjZW50cm9pZHMKYGBge3J9CiBwb2ludHMoZmluYWxDeCwgZmluYWxDeSwgY29sID0gY29sczEsIHBjaCA9IDksIGNleCA9IDIsIGx3ZCA9IDIpCgpgYGAKCgp8IE5vdyB0aGF0IHlvdSd2ZSBnb25lIHRocm91Z2ggYW4gZXhhbXBsZSBzdGVwIGJ5IHN0ZXAsIHlvdSdsbCBiZQp8IHJlbGlldmVkIHRvIGhlYXIgdGhhdCBSIHByb3ZpZGVzIGEgY29tbWFuZCB0byBkbyBhbGwgdGhpcyB3b3JrIGZvcgp8IHlvdS4gVW5zdXJwcmlzaW5nbHkgaXQncyBjYWxsZWQga21lYW5zIGFuZCwgYWx0aG91Z2ggaXQgaGFzIHNldmVyYWwKfCBwYXJhbWV0ZXJzLCB3ZSdsbCBqdXN0IG1lbnRpb24gZm91ci4gVGhlc2UgYXJlIHgsICh0aGUgbnVtZXJpYyBtYXRyaXgKfCBvZiBkYXRhKSwgY2VudGVycywgaXRlci5tYXgsIGFuZCBuc3RhcnQuIFRoZSBzZWNvbmQgb2YgdGhlc2UKfCAoY2VudGVycykgY2FuIGJlIGVpdGhlciBhIG51bWJlciBvZiBjbHVzdGVycyBvciBhIHNldCBvZiBpbml0aWFsCnwgY2VudHJvaWRzLiBUaGUgdGhpcmQsIGl0ZXIubWF4LCBzcGVjaWZpZXMgdGhlIG1heGltdW0gbnVtYmVyIG9mCnwgaXRlcmF0aW9ucyB0byBnbyB0aHJvdWdoLCBhbmQgbnN0YXJ0IGlzIHRoZSBudW1iZXIgb2YgcmFuZG9tIHN0YXJ0cwp8IHlvdSB3YW50IHRvIHRyeSBpZiB5b3Ugc3BlY2lmeSBjZW50ZXJzIGFzIGEgbnVtYmVyLgoKCmBgYHtyfQprbWVhbnMoZGF0YUZyYW1lLCBjZW50ZXIgPSAzKQprbU9iaiRpdGVyCmBgYAoKYGBge3J9CiBwbG90KHgseSwgY29sID1rbU9iaiRjbHVzdGVyLCBwY2g9IDE5LCBjZXggPSAyICkKcG9pbnRzKGttT2JqJGNlbnRlcnMsIGNvbD1jKCJibGFjayIsInJlZCIsImdyZWVuIiksIHBjaCA9IDMsIGNleCA9IDMsIGx3ZCA9IDMpCmBgYAoKCgpgYGB7cn0KcGxvdCh4LHksIGNvbCA9ICBrbWVhbnMoZGF0YUZyYW1lLDYpJGNsdXN0ZXIsIHBjaCA9IDE5LCBjZXggPSAyICkKCmBgYAoKCgoKCgoKCgoKCgoKCgoKCg==