Since dataset only contains a matrix of intensity values, we set header=FALSE.
The default in R assumes that the first row in the dataset, is the header. So, if we didn’t specify that we have no headers in this case, we’d have lost the information from the first row of the pixel intensity matrix.
Actually, R treats the rows as observations and the columns as variables.
flower = read.csv("flower.csv", header=FALSE)
str(flower)
## 'data.frame': 50 obs. of 50 variables:
## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ...
## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ...
## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ...
## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ...
## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ...
## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ...
## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ...
## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ...
## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ...
## $ V10: num 0.121 0.108 0.112 0.116 0.125 ...
## $ V11: num 0.134 0.125 0.129 0.134 0.129 ...
## $ V12: num 0.147 0.134 0.138 0.129 0.138 ...
## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ...
## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ...
## $ V15: num 0.142 0.142 0.134 0.121 0.116 ...
## $ V16: num 0.125 0.125 0.116 0.108 0.108 ...
## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ...
## $ V18: num 0.108 0.112 0.108 0.108 0.108 ...
## $ V19: num 0.121 0.129 0.125 0.116 0.116 ...
## $ V20: num 0.138 0.129 0.125 0.116 0.116 ...
## $ V21: num 0.138 0.134 0.121 0.125 0.125 ...
## $ V22: num 0.134 0.129 0.125 0.121 0.103 ...
## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ...
## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ...
## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ...
## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ...
## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ...
## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ...
## $ V29: num 0.116 0.134 0.134 0.121 0.142 ...
## $ V30: num 0.121 0.138 0.142 0.129 0.138 ...
## $ V31: num 0.121 0.134 0.142 0.134 0.129 ...
## $ V32: num 0.116 0.134 0.129 0.116 0.112 ...
## $ V33: num 0.108 0.112 0.116 0.108 0.108 ...
## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ...
## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ...
## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ...
## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ...
## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ...
## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ...
## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ...
## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ...
## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ...
## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ...
## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ...
## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ...
## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ...
## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ...
## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ...
## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ...
## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...
There are V1,V2,…,V50. So, we need to convert the data type.
flowerMatrix = as.matrix(flower)
str(flowerMatrix)
## num [1:50, 1:50] 0.0991 0.0991 0.1034 0.1034 0.1034 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...
We’d need to convert the matrix of pixel intensities to a vector that contains all the intensity values ranging from zero to one.
flowerVector = as.vector(flowerMatrix)
str(flowerVector)
## num [1:2500] 0.0991 0.0991 0.1034 0.1034 0.1034 ...
Below, it’s a wrong situation.
Therefore, converting the data to a matrix and then to the vector, is a crucual step.
flowerVector2 = as.vector(flower) # Wrong!
str(flowerVector2) # Wrong
## 'data.frame': 50 obs. of 50 variables:
## $ V1 : num 0.0991 0.0991 0.1034 0.1034 0.1034 ...
## $ V2 : num 0.112 0.108 0.112 0.116 0.108 ...
## $ V3 : num 0.134 0.116 0.121 0.116 0.112 ...
## $ V4 : num 0.138 0.138 0.121 0.121 0.112 ...
## $ V5 : num 0.138 0.134 0.125 0.116 0.112 ...
## $ V6 : num 0.138 0.129 0.121 0.108 0.112 ...
## $ V7 : num 0.129 0.116 0.103 0.108 0.112 ...
## $ V8 : num 0.116 0.103 0.103 0.103 0.116 ...
## $ V9 : num 0.1121 0.0991 0.1078 0.1121 0.1164 ...
## $ V10: num 0.121 0.108 0.112 0.116 0.125 ...
## $ V11: num 0.134 0.125 0.129 0.134 0.129 ...
## $ V12: num 0.147 0.134 0.138 0.129 0.138 ...
## $ V13: num 0.000862 0.146552 0.142241 0.142241 0.133621 ...
## $ V14: num 0.000862 0.000862 0.142241 0.133621 0.12931 ...
## $ V15: num 0.142 0.142 0.134 0.121 0.116 ...
## $ V16: num 0.125 0.125 0.116 0.108 0.108 ...
## $ V17: num 0.1121 0.1164 0.1078 0.0991 0.0991 ...
## $ V18: num 0.108 0.112 0.108 0.108 0.108 ...
## $ V19: num 0.121 0.129 0.125 0.116 0.116 ...
## $ V20: num 0.138 0.129 0.125 0.116 0.116 ...
## $ V21: num 0.138 0.134 0.121 0.125 0.125 ...
## $ V22: num 0.134 0.129 0.125 0.121 0.103 ...
## $ V23: num 0.125 0.1207 0.1164 0.1164 0.0819 ...
## $ V24: num 0.1034 0.1034 0.0991 0.0991 0.1034 ...
## $ V25: num 0.0948 0.0905 0.0905 0.1034 0.125 ...
## $ V26: num 0.0862 0.0862 0.0991 0.125 0.1422 ...
## $ V27: num 0.086207 0.086207 0.103448 0.12931 0.000862 ...
## $ V28: num 0.0991 0.1078 0.1164 0.1293 0.1466 ...
## $ V29: num 0.116 0.134 0.134 0.121 0.142 ...
## $ V30: num 0.121 0.138 0.142 0.129 0.138 ...
## $ V31: num 0.121 0.134 0.142 0.134 0.129 ...
## $ V32: num 0.116 0.134 0.129 0.116 0.112 ...
## $ V33: num 0.108 0.112 0.116 0.108 0.108 ...
## $ V34: num 0.1078 0.1078 0.1034 0.0991 0.1034 ...
## $ V35: num 0.1078 0.1034 0.0991 0.0991 0.0991 ...
## $ V36: num 0.1078 0.1034 0.1034 0.0905 0.0862 ...
## $ V37: num 0.1078 0.1078 0.1034 0.0819 0.0733 ...
## $ V38: num 0.0948 0.0991 0.0776 0.069 0.0733 ...
## $ V39: num 0.0733 0.056 0.0474 0.0474 0.056 ...
## $ V40: num 0.0474 0.0388 0.0431 0.0474 0.0603 ...
## $ V41: num 0.0345 0.0345 0.0388 0.0474 0.0647 ...
## $ V42: num 0.0259 0.0259 0.0345 0.0431 0.056 ...
## $ V43: num 0.0259 0.0259 0.0388 0.0517 0.0603 ...
## $ V44: num 0.0302 0.0302 0.0345 0.0517 0.0603 ...
## $ V45: num 0.0259 0.0259 0.0259 0.0388 0.0474 ...
## $ V46: num 0.0259 0.0172 0.0172 0.0259 0.0345 ...
## $ V47: num 0.01724 0.01724 0.00862 0.02155 0.02586 ...
## $ V48: num 0.0216 0.0129 0.0129 0.0172 0.0302 ...
## $ V49: num 0.0216 0.0216 0.0216 0.0345 0.0603 ...
## $ V50: num 0.0302 0.0345 0.0388 0.0603 0.0776 ...
distance = dist(flowerVector, method = "euclidean")
clusterIntensity = hclust(distance, method="ward.D2")
the ward method is a minimum variace method, which tries to find compact and spherical clusters. We can think about it as trying to minimize the variance within each cluster and the distance among clusters.
plot(clusterIntensity)
rect.hclust(clusterIntensity, k = 3, border = "red")
It seems like 2 Cluster or 3 Cluster is reasonable. So, after Hierarchical clustering, Select 3 clusters.
flowerClusters = cutree(clusterIntensity, k = 3)
flowerClusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [69] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [103] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [137] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [171] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [307] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [341] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
## [375] 2 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [409] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 3 3 3 3 3 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1
## [443] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3
## [477] 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [511] 1 1 1 1 1 1 2 3 3 2 1 1 3 3 3 3 3 2 3 3 3 1 2 3 3 1 1 1 1 1 1 1 1 1
## [545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 3 3 3 3 3 2
## [579] 3 3 2 2 3 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [613] 2 3 3 2 1 3 3 3 2 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
## [647] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 3 3 3 1 2 3 3 3 3 3 3 3
## [681] 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [715] 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
## [749] 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [783] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1
## [817] 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 2 2 3 3 1 1 1 1 1 1 1 1
## [851] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 2 1 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [885] 2 1 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 2 2
## [919] 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
## [953] 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 3 3 3 2 2 3 3
## [987] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1021] 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1055] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 2
## [1089] 2 2 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
## [1123] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1157] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2
## [1191] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3 2 2 2 2 2 2
## [1225] 2 2 2 2 2 2 2 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [1293] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [1327] 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1361] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1
## [1395] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
## [1429] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3
## [1463] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1
## [1497] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 2 3 3 3 2 3
## [1531] 3 3 3 3 3 2 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 1
## [1565] 2 3 3 3 3 3 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3 2 1 2 2 2 2 2 1 1 1 1 1 1
## [1599] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3
## [1633] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3
## [1667] 2 3 3 3 3 3 2 2 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1
## [1701] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 3 3 3 3 3 3 1 3 3 3 3 3 3 2 1 2 3 3
## [1735] 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3
## [1769] 3 3 3 1 1 3 3 3 3 3 3 2 1 1 2 3 3 3 3 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1803] 1 1 1 1 1 1 1 1 1 2 2 1 2 3 3 3 3 3 2 1 2 3 3 2 3 3 3 2 1 1 1 1 3 3
## [1837] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [1871] 1 1 2 3 3 2 3 3 3 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1905] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 1 1 2 3 3 1 3 3 3 3 1 1 1 1 1 1 1 1
## [1939] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 1 1
## [1973] 2 3 3 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2007] 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1
## [2041] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [2075] 3 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2143] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2177] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2245] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2279] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2313] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2347] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2381] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2449] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2483] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
flowerClusters vector is a vector that assigns each intensity value in the flower vector to a cluster.
tapply(flowerVector, flowerClusters, mean)
## 1 2 3
## 0.08574315 0.50826255 0.93147713
dim(flowerClusters) = c(50,50)
image(flowerClusters, axes = FALSE)
# Original image
image(flowerMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))
healthy = read.csv("healthy.csv", header=FALSE)
healthyMatrix = as.matrix(healthy)
str(healthyMatrix)
## num [1:566, 1:646] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:646] "V1" "V2" "V3" "V4" ...
image(healthyMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))
healthVector is huge (n=566*646=365636) and R can’t allocate enough memory.
healthyVector = as.vector(healthyMatrix)
#distance = dist(healthyVector, method = "euclidean") # ERROR !!!
str(healthyVector)
## num [1:365636] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
Stetting the number of clusters depends on exactly what you’re trying to extract from the image.
k = 5
set.seed(1) # To obtain the same clusters, set the seed
KMC = kmeans(healthyVector, centers = k, iter.max = 1000)
str(KMC)
## List of 9
## $ cluster : int [1:365636] 3 3 3 3 3 3 3 3 3 3 ...
## $ centers : num [1:5, 1] 0.4818 0.1062 0.0196 0.3094 0.1842
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : NULL
## $ totss : num 5775
## $ withinss : num [1:5] 96.6 47.2 39.2 57.5 62.3
## $ tot.withinss: num 303
## $ betweenss : num 5472
## $ size : int [1:5] 20556 101085 133162 31555 79278
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
The K-Means algorithm is actually quite fast even though we have a high resolution image.
healthyClusters = KMC$cluster
KMC$centers[2]
## [1] 0.1061945
Via K-Means’ property, there exist the mean of every clusters (i.e.,centroids) in the KMC$centers variables.
On the other hand, the hierarchical clustering should use tapply function to get the centroids.
dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix))
image(healthyClusters, axes = FALSE, col=rainbow(k))
tumor = read.csv("tumor.csv", header=FALSE)
tumorMatrix = as.matrix(tumor)
tumorVector = as.vector(tumorMatrix)
#install.packages("flexclust")
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
KMC.kcca = as.kcca(KMC, healthyVector) # training set = healthyVector
tumorClusters = predict(KMC.kcca, newdata = tumorVector) # test set = tumorVector
The flexclust package contains the object class KCCA,which stands for K-Centroids Cluster Analysis. We need to convert the information from the clustering algorithm to an object of teh class KCCA.
In the tumor data, we’ll not run the K-Means algorithm. Instead, we’ll apply the K-Means Clustering results that we found using the healthy brain image on the tumor vector. In other words, we treat the healthy vector as training set and the tumor vector as a testing set.
dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix))
image(tumorClusters, axes = FALSE, col=rainbow(k))
While dendrograms can be used to select the final number of clusters for Hierarchical Clustering, we can’t use dendrograms for K-Means Clustering. However, there are several other ways that the number of clusters can be selected. One common way to select the number of clusters is by using a scree plot, which works for any clustering algorithm.
A standard scree plot has the number of clusters on the x-axis, and the sum of the within-cluster sum of squares on the y-axis. We ideally want very small within-cluster sum of squares because this means that the points are all very close to their centroid.
To create the scree plot, the clustering algorithm is run with a range of values for the number of clusters. For each number of clusters, the within-cluster sum of squares can easily be extracted when using K-Means clustering.
KMC2 = kmeans(healthyVector, centers=2, iter.max=1000)
KMC3 = kmeans(healthyVector, centers=3, iter.max=1000)
KMC4 = kmeans(healthyVector, centers=4, iter.max=1000)
KMC5 = kmeans(healthyVector, centers=5, iter.max=1000)
KMC6 = kmeans(healthyVector, centers=6, iter.max=1000)
KMC7 = kmeans(healthyVector, centers=7, iter.max=1000)
KMC8 = kmeans(healthyVector, centers=8, iter.max=1000)
KMC9 = kmeans(healthyVector, centers=9, iter.max=1000)
KMC10 = kmeans(healthyVector, centers=10, iter.max=1000)
NumClusters = seq(2,10,1)
SumWithinss = c(sum(KMC2$withinss), sum(KMC3$withinss), sum(KMC4$withinss), sum(KMC5$withinss), sum(KMC6$withinss), sum(KMC7$withinss), sum(KMC8$withinss), sum(KMC9$withinss), sum(KMC10$withinss))
plot(NumClusters, SumWithinss, type="b")
To determine the best number of clusters using this plot, we want to look for a bend, or elbow, in the plot. This means that we want to find the number of clusters for which increasing the number of clusters further does not significantly help to reduce the within-cluster sum of squares. For this particular dataset, it looks like 4 or 5 clusters is a good choice. Beyond 5, increasing the number of clusters does not really reduce the within-cluster sum of squares too much.
You may have noticed it took a lot of typing to generate SumWithinss. In fact, R has powerful functions for repeating tasks with a different input (in this case running kmeans with different cluster sizes). For instance, we could generate SumWithinss with:
SumWithinss = sapply(2:10, function(x) sum(kmeans(healthyVector, centers = x, iter.max=1000)$withinss))