Unit 6 - Recitation

6.1 Flower Image Segmentation

Since dataset only contains a matrix of intensity values, we set header=FALSE.
The default in R assumes that the first row in the dataset, is the header. So, if we didn’t specify that we have no headers in this case, we’d have lost the information from the first row of the pixel intensity matrix.
Actually, R treats the rows as observations and the columns as variables.

flower = read.csv("flower.csv", header=FALSE)
str(flower)
## 'data.frame':    50 obs. of  50 variables:
##  $ V1 : num  0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  $ V2 : num  0.112 0.108 0.112 0.116 0.108 ...
##  $ V3 : num  0.134 0.116 0.121 0.116 0.112 ...
##  $ V4 : num  0.138 0.138 0.121 0.121 0.112 ...
##  $ V5 : num  0.138 0.134 0.125 0.116 0.112 ...
##  $ V6 : num  0.138 0.129 0.121 0.108 0.112 ...
##  $ V7 : num  0.129 0.116 0.103 0.108 0.112 ...
##  $ V8 : num  0.116 0.103 0.103 0.103 0.116 ...
##  $ V9 : num  0.1121 0.0991 0.1078 0.1121 0.1164 ...
##  $ V10: num  0.121 0.108 0.112 0.116 0.125 ...
##  $ V11: num  0.134 0.125 0.129 0.134 0.129 ...
##  $ V12: num  0.147 0.134 0.138 0.129 0.138 ...
##  $ V13: num  0.000862 0.146552 0.142241 0.142241 0.133621 ...
##  $ V14: num  0.000862 0.000862 0.142241 0.133621 0.12931 ...
##  $ V15: num  0.142 0.142 0.134 0.121 0.116 ...
##  $ V16: num  0.125 0.125 0.116 0.108 0.108 ...
##  $ V17: num  0.1121 0.1164 0.1078 0.0991 0.0991 ...
##  $ V18: num  0.108 0.112 0.108 0.108 0.108 ...
##  $ V19: num  0.121 0.129 0.125 0.116 0.116 ...
##  $ V20: num  0.138 0.129 0.125 0.116 0.116 ...
##  $ V21: num  0.138 0.134 0.121 0.125 0.125 ...
##  $ V22: num  0.134 0.129 0.125 0.121 0.103 ...
##  $ V23: num  0.125 0.1207 0.1164 0.1164 0.0819 ...
##  $ V24: num  0.1034 0.1034 0.0991 0.0991 0.1034 ...
##  $ V25: num  0.0948 0.0905 0.0905 0.1034 0.125 ...
##  $ V26: num  0.0862 0.0862 0.0991 0.125 0.1422 ...
##  $ V27: num  0.086207 0.086207 0.103448 0.12931 0.000862 ...
##  $ V28: num  0.0991 0.1078 0.1164 0.1293 0.1466 ...
##  $ V29: num  0.116 0.134 0.134 0.121 0.142 ...
##  $ V30: num  0.121 0.138 0.142 0.129 0.138 ...
##  $ V31: num  0.121 0.134 0.142 0.134 0.129 ...
##  $ V32: num  0.116 0.134 0.129 0.116 0.112 ...
##  $ V33: num  0.108 0.112 0.116 0.108 0.108 ...
##  $ V34: num  0.1078 0.1078 0.1034 0.0991 0.1034 ...
##  $ V35: num  0.1078 0.1034 0.0991 0.0991 0.0991 ...
##  $ V36: num  0.1078 0.1034 0.1034 0.0905 0.0862 ...
##  $ V37: num  0.1078 0.1078 0.1034 0.0819 0.0733 ...
##  $ V38: num  0.0948 0.0991 0.0776 0.069 0.0733 ...
##  $ V39: num  0.0733 0.056 0.0474 0.0474 0.056 ...
##  $ V40: num  0.0474 0.0388 0.0431 0.0474 0.0603 ...
##  $ V41: num  0.0345 0.0345 0.0388 0.0474 0.0647 ...
##  $ V42: num  0.0259 0.0259 0.0345 0.0431 0.056 ...
##  $ V43: num  0.0259 0.0259 0.0388 0.0517 0.0603 ...
##  $ V44: num  0.0302 0.0302 0.0345 0.0517 0.0603 ...
##  $ V45: num  0.0259 0.0259 0.0259 0.0388 0.0474 ...
##  $ V46: num  0.0259 0.0172 0.0172 0.0259 0.0345 ...
##  $ V47: num  0.01724 0.01724 0.00862 0.02155 0.02586 ...
##  $ V48: num  0.0216 0.0129 0.0129 0.0172 0.0302 ...
##  $ V49: num  0.0216 0.0216 0.0216 0.0345 0.0603 ...
##  $ V50: num  0.0302 0.0345 0.0388 0.0603 0.0776 ...

There are V1,V2,…,V50. So, we need to convert the data type.


Change the data type to matrix

flowerMatrix = as.matrix(flower)
str(flowerMatrix)
##  num [1:50, 1:50] 0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...

Turn matrix into a vector

We’d need to convert the matrix of pixel intensities to a vector that contains all the intensity values ranging from zero to one.

flowerVector = as.vector(flowerMatrix)
str(flowerVector)
##  num [1:2500] 0.0991 0.0991 0.1034 0.1034 0.1034 ...

Below, it’s a wrong situation.
Therefore, converting the data to a matrix and then to the vector, is a crucual step.

flowerVector2 = as.vector(flower) # Wrong!
str(flowerVector2) # Wrong
## 'data.frame':    50 obs. of  50 variables:
##  $ V1 : num  0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  $ V2 : num  0.112 0.108 0.112 0.116 0.108 ...
##  $ V3 : num  0.134 0.116 0.121 0.116 0.112 ...
##  $ V4 : num  0.138 0.138 0.121 0.121 0.112 ...
##  $ V5 : num  0.138 0.134 0.125 0.116 0.112 ...
##  $ V6 : num  0.138 0.129 0.121 0.108 0.112 ...
##  $ V7 : num  0.129 0.116 0.103 0.108 0.112 ...
##  $ V8 : num  0.116 0.103 0.103 0.103 0.116 ...
##  $ V9 : num  0.1121 0.0991 0.1078 0.1121 0.1164 ...
##  $ V10: num  0.121 0.108 0.112 0.116 0.125 ...
##  $ V11: num  0.134 0.125 0.129 0.134 0.129 ...
##  $ V12: num  0.147 0.134 0.138 0.129 0.138 ...
##  $ V13: num  0.000862 0.146552 0.142241 0.142241 0.133621 ...
##  $ V14: num  0.000862 0.000862 0.142241 0.133621 0.12931 ...
##  $ V15: num  0.142 0.142 0.134 0.121 0.116 ...
##  $ V16: num  0.125 0.125 0.116 0.108 0.108 ...
##  $ V17: num  0.1121 0.1164 0.1078 0.0991 0.0991 ...
##  $ V18: num  0.108 0.112 0.108 0.108 0.108 ...
##  $ V19: num  0.121 0.129 0.125 0.116 0.116 ...
##  $ V20: num  0.138 0.129 0.125 0.116 0.116 ...
##  $ V21: num  0.138 0.134 0.121 0.125 0.125 ...
##  $ V22: num  0.134 0.129 0.125 0.121 0.103 ...
##  $ V23: num  0.125 0.1207 0.1164 0.1164 0.0819 ...
##  $ V24: num  0.1034 0.1034 0.0991 0.0991 0.1034 ...
##  $ V25: num  0.0948 0.0905 0.0905 0.1034 0.125 ...
##  $ V26: num  0.0862 0.0862 0.0991 0.125 0.1422 ...
##  $ V27: num  0.086207 0.086207 0.103448 0.12931 0.000862 ...
##  $ V28: num  0.0991 0.1078 0.1164 0.1293 0.1466 ...
##  $ V29: num  0.116 0.134 0.134 0.121 0.142 ...
##  $ V30: num  0.121 0.138 0.142 0.129 0.138 ...
##  $ V31: num  0.121 0.134 0.142 0.134 0.129 ...
##  $ V32: num  0.116 0.134 0.129 0.116 0.112 ...
##  $ V33: num  0.108 0.112 0.116 0.108 0.108 ...
##  $ V34: num  0.1078 0.1078 0.1034 0.0991 0.1034 ...
##  $ V35: num  0.1078 0.1034 0.0991 0.0991 0.0991 ...
##  $ V36: num  0.1078 0.1034 0.1034 0.0905 0.0862 ...
##  $ V37: num  0.1078 0.1078 0.1034 0.0819 0.0733 ...
##  $ V38: num  0.0948 0.0991 0.0776 0.069 0.0733 ...
##  $ V39: num  0.0733 0.056 0.0474 0.0474 0.056 ...
##  $ V40: num  0.0474 0.0388 0.0431 0.0474 0.0603 ...
##  $ V41: num  0.0345 0.0345 0.0388 0.0474 0.0647 ...
##  $ V42: num  0.0259 0.0259 0.0345 0.0431 0.056 ...
##  $ V43: num  0.0259 0.0259 0.0388 0.0517 0.0603 ...
##  $ V44: num  0.0302 0.0302 0.0345 0.0517 0.0603 ...
##  $ V45: num  0.0259 0.0259 0.0259 0.0388 0.0474 ...
##  $ V46: num  0.0259 0.0172 0.0172 0.0259 0.0345 ...
##  $ V47: num  0.01724 0.01724 0.00862 0.02155 0.02586 ...
##  $ V48: num  0.0216 0.0129 0.0129 0.0172 0.0302 ...
##  $ V49: num  0.0216 0.0216 0.0216 0.0345 0.0603 ...
##  $ V50: num  0.0302 0.0345 0.0388 0.0603 0.0776 ...

Compute distances

distance = dist(flowerVector, method = "euclidean")

Hierarchical clustering

clusterIntensity = hclust(distance, method="ward.D2")

the ward method is a minimum variace method, which tries to find compact and spherical clusters. We can think about it as trying to minimize the variance within each cluster and the distance among clusters.


Plot the dendrogram

plot(clusterIntensity)
rect.hclust(clusterIntensity, k = 3, border = "red")

It seems like 2 Cluster or 3 Cluster is reasonable. So, after Hierarchical clustering, Select 3 clusters.


Assign intesity value to a cluster

flowerClusters = cutree(clusterIntensity, k = 3)
flowerClusters
##    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [69] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [103] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [137] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [171] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [307] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [341] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
##  [375] 2 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [409] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 3 3 3 3 3 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1
##  [443] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3
##  [477] 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [511] 1 1 1 1 1 1 2 3 3 2 1 1 3 3 3 3 3 2 3 3 3 1 2 3 3 1 1 1 1 1 1 1 1 1
##  [545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 3 3 3 3 3 2
##  [579] 3 3 2 2 3 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [613] 2 3 3 2 1 3 3 3 2 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
##  [647] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 3 3 3 1 2 3 3 3 3 3 3 3
##  [681] 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
##  [715] 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
##  [749] 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
##  [783] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1
##  [817] 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 2 2 3 3 1 1 1 1 1 1 1 1
##  [851] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 2 1 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
##  [885] 2 1 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 2 2
##  [919] 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
##  [953] 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 3 3 3 2 2 3 3
##  [987] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1021] 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1055] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 2
## [1089] 2 2 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
## [1123] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1157] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2
## [1191] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3 2 2 2 2 2 2
## [1225] 2 2 2 2 2 2 2 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [1293] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [1327] 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1361] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1
## [1395] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
## [1429] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3
## [1463] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1
## [1497] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 2 3 3 3 2 3
## [1531] 3 3 3 3 3 2 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 1
## [1565] 2 3 3 3 3 3 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3 2 1 2 2 2 2 2 1 1 1 1 1 1
## [1599] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3
## [1633] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3
## [1667] 2 3 3 3 3 3 2 2 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1
## [1701] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 3 3 3 3 3 3 1 3 3 3 3 3 3 2 1 2 3 3
## [1735] 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3
## [1769] 3 3 3 1 1 3 3 3 3 3 3 2 1 1 2 3 3 3 3 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1803] 1 1 1 1 1 1 1 1 1 2 2 1 2 3 3 3 3 3 2 1 2 3 3 2 3 3 3 2 1 1 1 1 3 3
## [1837] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [1871] 1 1 2 3 3 2 3 3 3 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1905] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 1 1 2 3 3 1 3 3 3 3 1 1 1 1 1 1 1 1
## [1939] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 1 1
## [1973] 2 3 3 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2007] 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1
## [2041] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [2075] 3 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2143] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2177] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2245] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2279] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2313] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2347] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2381] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2449] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2483] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

flowerClusters vector is a vector that assigns each intensity value in the flower vector to a cluster.


Find mean intensity values

tapply(flowerVector, flowerClusters, mean)
##          1          2          3 
## 0.08574315 0.50826255 0.93147713

Plot the image and the clusters

dim(flowerClusters) = c(50,50)
image(flowerClusters, axes = FALSE)

# Original image
image(flowerMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))


6.2 MRI Image Segmentation

Let’s try this with an MRI image of the brain

healthy = read.csv("healthy.csv", header=FALSE)
healthyMatrix = as.matrix(healthy)
str(healthyMatrix)
##  num [1:566, 1:646] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:646] "V1" "V2" "V3" "V4" ...

Plot image

image(healthyMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))

Hierarchial Clustering

healthVector is huge (n=566*646=365636) and R can’t allocate enough memory.

healthyVector = as.vector(healthyMatrix)
#distance = dist(healthyVector, method = "euclidean") # ERROR !!!

We have an error - why?

str(healthyVector)
##  num [1:365636] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
  • n = 365636.
  • Pairwise distance computation complexity: n(n-1)/2=66844659430.
  • We can’t use hierarchical clusterinf due to heavy computation.
  • So, in the high resolution of the image, we need to use K-Means clustering algorithm.

K-Means Clustering

Specify number of clusters

Stetting the number of clusters depends on exactly what you’re trying to extract from the image.

k = 5

Run k-means

set.seed(1) # To obtain the same clusters, set the seed
KMC = kmeans(healthyVector, centers = k, iter.max = 1000)
str(KMC)
## List of 9
##  $ cluster     : int [1:365636] 3 3 3 3 3 3 3 3 3 3 ...
##  $ centers     : num [1:5, 1] 0.4818 0.1062 0.0196 0.3094 0.1842
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : NULL
##  $ totss       : num 5775
##  $ withinss    : num [1:5] 96.6 47.2 39.2 57.5 62.3
##  $ tot.withinss: num 303
##  $ betweenss   : num 5472
##  $ size        : int [1:5] 20556 101085 133162 31555 79278
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The K-Means algorithm is actually quite fast even though we have a high resolution image.


Extract clusters

healthyClusters = KMC$cluster
KMC$centers[2]
## [1] 0.1061945

Via K-Means’ property, there exist the mean of every clusters (i.e.,centroids) in the KMC$centers variables.
On the other hand, the hierarchical clustering should use tapply function to get the centroids.


Plot the image with the clusters

dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix))
image(healthyClusters, axes = FALSE, col=rainbow(k))


Apply to a test image

tumor = read.csv("tumor.csv", header=FALSE)
tumorMatrix = as.matrix(tumor)
tumorVector = as.vector(tumorMatrix)

Apply clusters from before to new image, using the flexclust package

#install.packages("flexclust")
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
KMC.kcca = as.kcca(KMC, healthyVector) # training set = healthyVector
tumorClusters = predict(KMC.kcca, newdata = tumorVector) # test set = tumorVector

The flexclust package contains the object class KCCA,which stands for K-Centroids Cluster Analysis. We need to convert the information from the clustering algorithm to an object of teh class KCCA.

In the tumor data, we’ll not run the K-Means algorithm. Instead, we’ll apply the K-Means Clustering results that we found using the healthy brain image on the tumor vector. In other words, we treat the healthy vector as training set and the tumor vector as a testing set.


Visualize the clusters

dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix))
image(tumorClusters, axes = FALSE, col=rainbow(k))


SCREE PLOTS

While dendrograms can be used to select the final number of clusters for Hierarchical Clustering, we can’t use dendrograms for K-Means Clustering. However, there are several other ways that the number of clusters can be selected. One common way to select the number of clusters is by using a scree plot, which works for any clustering algorithm.

A standard scree plot has the number of clusters on the x-axis, and the sum of the within-cluster sum of squares on the y-axis. We ideally want very small within-cluster sum of squares because this means that the points are all very close to their centroid.

To create the scree plot, the clustering algorithm is run with a range of values for the number of clusters. For each number of clusters, the within-cluster sum of squares can easily be extracted when using K-Means clustering.

KMC2 = kmeans(healthyVector, centers=2, iter.max=1000)
KMC3 = kmeans(healthyVector, centers=3, iter.max=1000)
KMC4 = kmeans(healthyVector, centers=4, iter.max=1000)
KMC5 = kmeans(healthyVector, centers=5, iter.max=1000)
KMC6 = kmeans(healthyVector, centers=6, iter.max=1000)
KMC7 = kmeans(healthyVector, centers=7, iter.max=1000)
KMC8 = kmeans(healthyVector, centers=8, iter.max=1000)
KMC9 = kmeans(healthyVector, centers=9, iter.max=1000)
KMC10 = kmeans(healthyVector, centers=10, iter.max=1000)

NumClusters = seq(2,10,1)
SumWithinss = c(sum(KMC2$withinss), sum(KMC3$withinss), sum(KMC4$withinss), sum(KMC5$withinss), sum(KMC6$withinss), sum(KMC7$withinss), sum(KMC8$withinss), sum(KMC9$withinss), sum(KMC10$withinss))

plot(NumClusters, SumWithinss, type="b")


To determine the best number of clusters using this plot, we want to look for a bend, or elbow, in the plot. This means that we want to find the number of clusters for which increasing the number of clusters further does not significantly help to reduce the within-cluster sum of squares. For this particular dataset, it looks like 4 or 5 clusters is a good choice. Beyond 5, increasing the number of clusters does not really reduce the within-cluster sum of squares too much.


Note

You may have noticed it took a lot of typing to generate SumWithinss. In fact, R has powerful functions for repeating tasks with a different input (in this case running kmeans with different cluster sizes). For instance, we could generate SumWithinss with:

SumWithinss = sapply(2:10, function(x) sum(kmeans(healthyVector, centers = x, iter.max=1000)$withinss))