What is Cluster Analysis?
Process of grouping observations of similar kinds into smaller groups within the larger population.
Let´s see one example: Suppose you own 8 icecream shops and you sell two flavors, Chocollate and Vanilla.
The table below shows the sales of both chocollate and vanilla across the 8 stores.
Cluster analysis is one of the many ways you can make sense of these data (mean, spread, etc).
To do so you can plot this data on a chart.
Each dot in the graph above represents a store, in the Y axis we have the number of sales of chocolate for each store and in the X axis we have the number of sales of vanilla for that particular store.
Looking at the graph below you can see that the 8 stores can be devided into two different groups that behave slightly different in terms of the magnitude of the sales. In Group 1 you can see the sales are lower that in Group 2.
What we´ve done is look at the sales of 8 stores for these two flavors of icecream and ploted it on a graph and just divided the store in two groups and their proximity to each other.
This is essentially how clustering works, a very simple two dimension example of how clustering works but it accurately explains how it works.
Now, in order to better understand this let´s look indepth into what we did in a quite intutivelly way.
It is as if we had taken a imaginary point in Group 1, central do the other plotted points and draw a circle around it. We did something simillar to Group 2. And all observations that fall inside that circle are grouped together into one cluster.
Now imagine that you have expanded the flavors you sell. How would you plot this information in a graph? You can not draw a 30 dimension graph. Even worse imagine that you have grown into a chain with 500 stores. We might have be talking about thousands of points.
However there is a mathematical way of dealing with such complexity. And that is what cluster analysis does.
To be continued…
set.seed(1234)
par(mar = c(0,0,0,0))
x = rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y = rnorm(12, mean=rep(1,2,1), sd=0.2)
plot(x,y, col="blue", pch=19, cex=2)
text(x + 0.05, y + 0.05, labels= as.character(1:12))
dataFrame = data.frame(x=x, y=y)
dist(dataFrame) # Distance matrix (Default euclidian distance)
## 1 2 3 4 5 6
## 2 0.34120511
## 3 0.57493739 0.24102750
## 4 0.26381786 0.52578819 0.71861759
## 5 1.32829783 1.03674741 0.91735828 1.55702849
## 6 1.34289555 1.06377512 0.96021156 1.57849918 0.08150268
## 7 1.12653104 0.84893905 0.75865878 1.36197257 0.21110433 0.21666557
## 8 1.29969142 0.95848707 0.73404628 1.45063548 0.61704200 0.69791931
## 9 2.13629539 1.83167669 1.67835968 2.35675598 0.81160529 0.81322878
## 10 2.06419586 1.76999236 1.63109790 2.29239480 0.73617872 0.72567124
## 11 2.14702468 1.85183204 1.71074417 2.37461984 0.81885779 0.80884612
## 12 2.05664233 1.74662555 1.58658782 2.27232243 0.74039824 0.75094539
## 7 8 9 10 11
## 2
## 3
## 4
## 5
## 6
## 7
## 8 0.65062566
## 9 1.02071213 1.09596506
## 10 0.93949958 1.09784758 0.14090406
## 11 1.02259080 1.16375491 0.11624471 0.08317570
## 12 0.95130649 0.99022086 0.10848966 0.19128645 0.20802789
distxy = dist(dataFrame)
hClustering = hclust(distxy)
plot(hClustering)
d <- dist(as.matrix(mtcars))
hc <- hclust(d)
plot(hc)
load("samsungData.rda")
samsungData <- transform(samsungData, activity = factor(activity))
sub1 <- subset(samsungData, subject == 1)
source("myplclust.R")
distanceMatrix <- dist(sub1[, 1:3])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering, lab.col = unclass(sub1$activity))
dendoheat = read.table("dendogramandheatmap.csv",sep=";", header=TRUE)
head(dendoheat)
## Test Time.1 Time.2 Time.3
## 1 Test A 2.02 3.21 5.57
## 2 Test B 2.92 4.37 6.02
## 3 Test C 2.64 5.02 7.19
## 4 Test D 2.37 3.48 8.21
## 5 Test E 2.21 3.12 5.38
## 6 Test F 2.43 3.84 6.47
row.names(dendoheat) = dendoheat$Test
dendoheat = subset(dendoheat, select=-Test)
dendoheatmatrix = as.matrix(dendoheat)
heatmap(dendoheatmatrix)
heatmap(dendoheatmatrix, col=rev(heat.colors(256)))
#Coursera.org.-.Exploratory.Data.Analysis\Week.3\5.K-Means Clustering (part 2)
# K-means clustering
set.seed(1234)
par(mar=c(0,0,0,0))
x = rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y = rnorm(12, mean=rep(1,2,1), sd=0.2)
### Implementing kmeans
dataframe = data.frame(x,y)
kmeansobj = kmeans(dataframe, centers=3)
names(kmeansobj)
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
kmeansobj$cluster
## [1] 3 3 3 3 1 1 1 1 2 2 2 2
kmeansobj$centers
## x y
## 1 1.9906904 1.0078229
## 2 2.8534966 0.9831222
## 3 0.8904553 1.0068707
### plotting the values in a kmenas
par(mar=rep(0.2,4))
plot(x,y,col=kmeansobj$cluster, pch=19, cex=2)
points(kmeansobj$centers, col=1:3, pch=3,cex=3, lwd=2)
### heatmaps
# Coursera.org.-.Exploratory.Data.Analysis\Week.3\5.K-Means Clustering (part 2)
set.seed(1234)
datamatrix = as.matrix(dataframe)[sample(1:12),]
kmeansobj2 = kmeans(datamatrix, centers=3)
par(mfrow= c(1,2), mar=c(2,4,0.1,0.10))
image(t(datamatrix)[,nrow(datamatrix):1], yaxt="n")
image(t(datamatrix)[,order(kmeansobj$cluster)],yaxt="n")