library(swirl)
swirl()
data clustering steps:
- guess how many clusters you have/want 2.randomly create a “centroid” (a phantom point) for each cluster and assign each point or observation in your dataset to the centroid to which it is closest. 3.readjust the centroid’s position by making it the average of the points assigned to it.
- recalculate the distance of the observations to the centroids and reassign any to the centroid closest to them.
- readjust the positions of the centroids based on the new cluster membership. The process stops once you reach an iteration in which no adjustments are made or when you’ve reached some predetermined maximum number of iterations.
Matrix of coordinates
cx<-c(1, 1.8, 2.5)
cy<-c(2, 1.0, 1.5)
cmat<- matrix(cx, cy)
cmat
[,1] [,2] [,3]
[1,] 1 1.8 2.5
[2,] 2 1.0 1.5
Putting the points on the graph
points(cx, cy, col = c("red", "orange", "purple"), pch = 3, cex = 2, lwd = 2)
3 centroind and 12 datapoints -> 36 distances
mdist function
mdist<-function(x,y,cx,cy){
distTmp <- matrix(NA,nrow=3,ncol=12)
distTmp[1,] <- (x-cx[1])^2 + (y-cy[1])^2
distTmp[2,] <- (x-cx[2])^2 + (y-cy[2])^2
distTmp[3,] <- (x-cx[3])^2 + (y-cy[3])^2
return(distTmp)
}
to assign a cluster to each point, we’ll look at each column and pick the minimum entry
apply(distTmp,2,which.min)
To colour the 12 data points according to their assignments
points(x,y, pch=19, cex = 2, col = cols1[newClust])
Now we have to recalculate our centroids so they are the average | (center of gravity) of the cluster of points assigned to them. We | have to do the x and y coordinates separately. To vectors x,y
tapply(x, newClust, mean)
tapply(y, newClust, mean)
points(newCx, newCy, col = cols1, pch = 8, cex = 2, lwd = 2)
mdist(x,y, newCx, newCy)
distTmp2
Now call apply with 3 arguments, distTmp2, 2, and which.min to find | the new cluster assignments for the points.
apply(distTmp2, 2, which.min)
newClust2
Now use tapply to find the x and y coordinate of the new centroid
tapply(x, newClust2, mean)
tapply(y, newClust2, mean)
finalCx
finalCy
new centroids
points(finalCx, finalCy, col = cols1, pch = 9, cex = 2, lwd = 2)
Now that you’ve gone through an example step by step, you’ll be
relieved to hear that R provides a command to do all this work for
you. Unsurprisingly it’s called kmeans and, although it has several
parameters, we’ll just mention four. These are x, (the numeric matrix
of data), centers, iter.max, and nstart. The second of these
(centers) can be either a number of clusters or a set of initial
centroids. The third, iter.max, specifies the maximum number of
iterations to go through, and nstart is the number of random starts
you want to try if you specify centers as a number.
kmeans(dataFrame, center = 3)
kmObj$iter
plot(x,y, col =kmObj$cluster, pch= 19, cex = 2 )
points(kmObj$centers, col=c("black","red","green"), pch = 3, cex = 3, lwd = 3)
plot(x,y, col = kmeans(dataFrame,6)$cluster, pch = 19, cex = 2 )
LS0tCnRpdGxlOiAiSy1tZWFucyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKYGBge3J9CmxpYnJhcnkoc3dpcmwpCnN3aXJsKCkKYGBgCgogZGF0YSBjbHVzdGVyaW5nIHN0ZXBzOgoKMS4gZ3Vlc3MgaG93IG1hbnkgY2x1c3RlcnMgeW91IGhhdmUvd2FudAoyLnJhbmRvbWx5IGNyZWF0ZSBhICJjZW50cm9pZCIgKGEgcGhhbnRvbSBwb2ludCkgZm9yIGVhY2ggY2x1c3RlciBhbmQgYXNzaWduIGVhY2ggcG9pbnQgb3Igb2JzZXJ2YXRpb24gaW4geW91ciBkYXRhc2V0IHRvIHRoZSBjZW50cm9pZCB0byB3aGljaCBpdCBpcyBjbG9zZXN0LgozLnJlYWRqdXN0IHRoZSBjZW50cm9pZCdzIHBvc2l0aW9uIGJ5IG1ha2luZyBpdCB0aGUgYXZlcmFnZSBvZiB0aGUgcG9pbnRzIGFzc2lnbmVkIHRvIGl0Lgo0LiByZWNhbGN1bGF0ZSB0aGUgZGlzdGFuY2Ugb2YgdGhlIG9ic2VydmF0aW9ucyB0byB0aGUgY2VudHJvaWRzIGFuZCByZWFzc2lnbiBhbnkgIHRvIHRoZSBjZW50cm9pZCBjbG9zZXN0IHRvIHRoZW0uCjUuIHJlYWRqdXN0IHRoZSBwb3NpdGlvbnMgb2YgdGhlIGNlbnRyb2lkcyBiYXNlZCBvbiB0aGUgbmV3IGNsdXN0ZXIgbWVtYmVyc2hpcC4KVGhlIHByb2Nlc3Mgc3RvcHMgb25jZSB5b3UgcmVhY2ggYW4gaXRlcmF0aW9uIGluIHdoaWNoIG5vIGFkanVzdG1lbnRzIGFyZSBtYWRlIG9yIHdoZW4geW91J3ZlIHJlYWNoZWQgc29tZSBwcmVkZXRlcm1pbmVkIG1heGltdW0gbnVtYmVyIG9mIGl0ZXJhdGlvbnMuCgoKTWF0cml4IG9mIGNvb3JkaW5hdGVzCmBgYHtyfQpjeDwtYygxLCAxLjgsICAyLjUpCmN5PC1jKDIsICAxLjAsICAxLjUpCmNtYXQ8LSBtYXRyaXgoY3gsIGN5KQpjbWF0CiAgICAgWywxXSBbLDJdIFssM10KWzEsXSAgICAxICAxLjggIDIuNQpbMixdICAgIDIgIDEuMCAgMS41CmBgYAoKClB1dHRpbmcgdGhlIHBvaW50cyBvbiB0aGUgZ3JhcGgKYGBge3J9CnBvaW50cyhjeCwgY3ksIGNvbCA9IGMoInJlZCIsICJvcmFuZ2UiLCAicHVycGxlIiksIHBjaCA9IDMsIGNleCA9IDIsIGx3ZCA9IDIpCmBgYAoKMyBjZW50cm9pbmQgYW5kIDEyIGRhdGFwb2ludHMgLT4gMzYgZGlzdGFuY2VzCgptZGlzdCBmdW5jdGlvbgpgYGB7cn0KbWRpc3Q8LWZ1bmN0aW9uKHgseSxjeCxjeSl7CiAgZGlzdFRtcCA8LSBtYXRyaXgoTkEsbnJvdz0zLG5jb2w9MTIpCiAgZGlzdFRtcFsxLF0gPC0gKHgtY3hbMV0pXjIgKyAoeS1jeVsxXSleMgogIGRpc3RUbXBbMixdIDwtICh4LWN4WzJdKV4yICsgKHktY3lbMl0pXjIKICBkaXN0VG1wWzMsXSA8LSAoeC1jeFszXSleMiArICh5LWN5WzNdKV4yICAKICByZXR1cm4oZGlzdFRtcCkKfQpgYGAKCnRvIGFzc2lnbiBhIGNsdXN0ZXIgdG8gZWFjaCBwb2ludCwgd2UnbGwgbG9vayBhdCBlYWNoIGNvbHVtbiBhbmQgcGljayB0aGUgbWluaW11bSBlbnRyeQoKCmBgYHtyfQphcHBseShkaXN0VG1wLDIsd2hpY2gubWluKQpgYGAKClRvIGNvbG91ciB0aGUgMTIgZGF0YSBwb2ludHMgYWNjb3JkaW5nIHRvIHRoZWlyIGFzc2lnbm1lbnRzCmBgYHtyfQpwb2ludHMoeCx5LCBwY2g9MTksIGNleCA9IDIsIGNvbCA9IGNvbHMxW25ld0NsdXN0XSkKYGBgCgogTm93IHdlIGhhdmUgdG8gcmVjYWxjdWxhdGUgb3VyIGNlbnRyb2lkcyBzbyB0aGV5IGFyZSB0aGUgYXZlcmFnZQp8IChjZW50ZXIgb2YgZ3Jhdml0eSkgb2YgdGhlIGNsdXN0ZXIgb2YgcG9pbnRzIGFzc2lnbmVkIHRvIHRoZW0uIFdlCnwgaGF2ZSB0byBkbyB0aGUgeCBhbmQgeSBjb29yZGluYXRlcyBzZXBhcmF0ZWx5LiBUbyB2ZWN0b3JzIHgseQoKYGBge3J9CnRhcHBseSh4LCBuZXdDbHVzdCwgbWVhbikKdGFwcGx5KHksIG5ld0NsdXN0LCBtZWFuKQpwb2ludHMobmV3Q3gsIG5ld0N5LCBjb2wgPSBjb2xzMSwgcGNoID0gOCwgY2V4ID0gMiwgbHdkID0gMikKYGBgCgoKYGBge3J9Cm1kaXN0KHgseSwgbmV3Q3gsIG5ld0N5KQpkaXN0VG1wMgpgYGAKCk5vdyBjYWxsIGFwcGx5IHdpdGggMyBhcmd1bWVudHMsIGRpc3RUbXAyLCAyLCBhbmQgd2hpY2gubWluIHRvIGZpbmQKfCB0aGUgbmV3IGNsdXN0ZXIgYXNzaWdubWVudHMgZm9yIHRoZSBwb2ludHMuCgpgYGB7cn0KYXBwbHkoZGlzdFRtcDIsIDIsIHdoaWNoLm1pbikKbmV3Q2x1c3QyCmBgYAoKTm93IHVzZSB0YXBwbHkgdG8gZmluZCB0aGUgeCBhbmQgeSBjb29yZGluYXRlIG9mIHRoZSBuZXcgY2VudHJvaWQKYGBge3J9CnRhcHBseSh4LCBuZXdDbHVzdDIsIG1lYW4pCnRhcHBseSh5LCBuZXdDbHVzdDIsIG1lYW4pCmZpbmFsQ3gKZmluYWxDeQpgYGAKCm5ldyBjZW50cm9pZHMKYGBge3J9CiBwb2ludHMoZmluYWxDeCwgZmluYWxDeSwgY29sID0gY29sczEsIHBjaCA9IDksIGNleCA9IDIsIGx3ZCA9IDIpCgpgYGAKCgp8IE5vdyB0aGF0IHlvdSd2ZSBnb25lIHRocm91Z2ggYW4gZXhhbXBsZSBzdGVwIGJ5IHN0ZXAsIHlvdSdsbCBiZQp8IHJlbGlldmVkIHRvIGhlYXIgdGhhdCBSIHByb3ZpZGVzIGEgY29tbWFuZCB0byBkbyBhbGwgdGhpcyB3b3JrIGZvcgp8IHlvdS4gVW5zdXJwcmlzaW5nbHkgaXQncyBjYWxsZWQga21lYW5zIGFuZCwgYWx0aG91Z2ggaXQgaGFzIHNldmVyYWwKfCBwYXJhbWV0ZXJzLCB3ZSdsbCBqdXN0IG1lbnRpb24gZm91ci4gVGhlc2UgYXJlIHgsICh0aGUgbnVtZXJpYyBtYXRyaXgKfCBvZiBkYXRhKSwgY2VudGVycywgaXRlci5tYXgsIGFuZCBuc3RhcnQuIFRoZSBzZWNvbmQgb2YgdGhlc2UKfCAoY2VudGVycykgY2FuIGJlIGVpdGhlciBhIG51bWJlciBvZiBjbHVzdGVycyBvciBhIHNldCBvZiBpbml0aWFsCnwgY2VudHJvaWRzLiBUaGUgdGhpcmQsIGl0ZXIubWF4LCBzcGVjaWZpZXMgdGhlIG1heGltdW0gbnVtYmVyIG9mCnwgaXRlcmF0aW9ucyB0byBnbyB0aHJvdWdoLCBhbmQgbnN0YXJ0IGlzIHRoZSBudW1iZXIgb2YgcmFuZG9tIHN0YXJ0cwp8IHlvdSB3YW50IHRvIHRyeSBpZiB5b3Ugc3BlY2lmeSBjZW50ZXJzIGFzIGEgbnVtYmVyLgoKCmBgYHtyfQprbWVhbnMoZGF0YUZyYW1lLCBjZW50ZXIgPSAzKQprbU9iaiRpdGVyCmBgYAoKYGBge3J9CiBwbG90KHgseSwgY29sID1rbU9iaiRjbHVzdGVyLCBwY2g9IDE5LCBjZXggPSAyICkKcG9pbnRzKGttT2JqJGNlbnRlcnMsIGNvbD1jKCJibGFjayIsInJlZCIsImdyZWVuIiksIHBjaCA9IDMsIGNleCA9IDMsIGx3ZCA9IDMpCmBgYAoKCgpgYGB7cn0KcGxvdCh4LHksIGNvbCA9ICBrbWVhbnMoZGF0YUZyYW1lLDYpJGNsdXN0ZXIsIHBjaCA9IDE5LCBjZXggPSAyICkKCmBgYAoKCgoKCgoKCgoKCgoKCgoKCg==