Clustering Techniques

In this document, I will quickly demonstrate some clustering techniques. The data will be from the acidosis.patients and all.mammals.milk.1956 datasets from the cluster.datasets package.

We will begin by using Principal Components Analysis (PCA) on the acidosis.patients dataset. First, we’ll load the data into memory, then get some idea of how the dataset is constructed. We will also copy the data into another object for the sake of brevity in the example.

library(cluster.datasets)
data(acidosis.patients)
set.seed(10)
acid <- acidosis.patients #Because writing out acidosis.patients continually can be a chore.
head(acid) #A quick view of some of the data.

##   ph.cerebrospinal.fluid ph.blood hco3.cerebrospinal.fluid hco3.blood
## 1                   39.8     38.0                     22.2       23.2
## 2                   53.7     37.2                     18.7       18.5
## 3                   47.3     39.8                     23.3       22.1
## 4                   41.7     37.6                     22.8       22.3
## 5                   44.7     38.5                     24.8       24.4
## 6                   47.9     39.8                     22.0       23.3
##   co2.cerebrospinal.fluid co2.blood
## 1                    38.8      36.5
## 2                    45.1      28.3
## 3                    48.2      36.4
## 4                    41.6      34.6
## 5                    48.5      38.8
## 6                    46.2      38.5

apply(acid,2,mean) #Mean values of each of the columns.

##   ph.cerebrospinal.fluid                 ph.blood hco3.cerebrospinal.fluid 
##                  47.5250                  41.5550                  21.9150 
##               hco3.blood  co2.cerebrospinal.fluid                co2.blood 
##                  22.7175                  45.5700                  36.4400

apply(acid,2,var) #Variance measures of each of the columns.

##   ph.cerebrospinal.fluid                 ph.blood hco3.cerebrospinal.fluid 
##                 14.06808                 74.54305                 22.93259 
##               hco3.blood  co2.cerebrospinal.fluid                co2.blood 
##                 54.95481                132.01138                114.57579

We will now implement PCA, creating a prcomp object using the prcomp() function, taking care to scale the data. Using the names() function, we can find the names of the components of the prcomp object. Among them are center and scale, which refer to the mean and standard deviation of the variables prior to implementing PCA. We can also find the rotation component which contains the loading vectors for the object.

acid.out <- prcomp(acid,scale=T)
names(acid.out)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

# Mean and standard deviation of the variables prior to implementing PCA.
acid.out$center

##   ph.cerebrospinal.fluid                 ph.blood hco3.cerebrospinal.fluid 
##                  47.5250                  41.5550                  21.9150 
##               hco3.blood  co2.cerebrospinal.fluid                co2.blood 
##                  22.7175                  45.5700                  36.4400

acid.out$scale

##   ph.cerebrospinal.fluid                 ph.blood hco3.cerebrospinal.fluid 
##                 3.750744                 8.633832                 4.788798 
##               hco3.blood  co2.cerebrospinal.fluid                co2.blood 
##                 7.413151                11.489621                10.704008

# Rotation provides the PCA loadings. Each column corresponds to a loading vector.
acid.out$rotation

##                                 PC1         PC2         PC3         PC4
## ph.cerebrospinal.fluid    0.3023928 -0.59919446 -0.71562237  0.01823797
## ph.blood                 -0.2708919 -0.76609506  0.52164132 -0.20533114
## hco3.cerebrospinal.fluid  0.4557668  0.12556445  0.27440844 -0.01248491
## hco3.blood                0.4604927  0.06813693  0.04119528 -0.83305558
## co2.cerebrospinal.fluid   0.4661825 -0.09928598  0.12246270  0.41257395
## co2.blood                 0.4450540 -0.15426466  0.35182497  0.30520774
##                                  PC5        PC6
## ph.cerebrospinal.fluid    0.05223108 -0.1853334
## ph.blood                  0.15342065  0.0436751
## hco3.cerebrospinal.fluid  0.63678342 -0.5436554
## hco3.blood               -0.17054689  0.2419482
## co2.cerebrospinal.fluid   0.25761344  0.7219675
## co2.blood                -0.68759725 -0.2973328

dim(acid.out$x) #Dimensions of the vector score matrix.

## [1] 40  6

Next, we can print a scree plot the variance in each of the components as well as a biplot illustrating the effect of the variables on the first two components.

plot(acid.out)

biplot(acid.out,scale=0)

From the scree plot, we can see that the first component contains the largest amount of variance. From the biplot, we can tell that ph.cerebospinal.fluid and ph.blood are primary contributors.

We will continue with two other plots, looking at the variance in each of the components. First, we will compute the variance for each of the components.

## acid.out$sdev Standard Deviaton of the individual components.
acid.out.var <- acid.out$sdev^2
acid.out.var #Variance of the individual components.

## [1] 4.43444240 0.89452699 0.53304151 0.07584956 0.05679048 0.00534907

Afterward, we can figure out the proportion of variance explained by each of the components.

acid.out.pve <- acid.out.var/sum(acid.out.var)
acid.out.pve

## [1] 0.7390737330 0.1490878314 0.0888402510 0.0126415929 0.0094650801
## [6] 0.0008915117

Plots can then be made of the variance by the components, both individually and cumulatively.

plot(acid.out.pve,xlab='Principal Components',ylab='Proportion of Variance Explained',ylim=c(0,1),type='b',main='Proportion of Variance Explained by each Component')

plot(cumsum(acid.out.pve),xlab='Principal Components',ylab='Cumulative Proportion of Explained Variance',ylim=c(0,1),type='b',main='Cumulative Proportion of Explained Variance')

Next is an example of using K-Means for clustering. Since we already have the data, we will use the vector scores from the first two principal components from the acidosis.patients data in a matrix.

x1 <- acid.out$x[,1]
x2 <- acid.out$x[,2]
x <- cbind(x1,x2)
head(x)

##              x1          x2
## [1,] -0.7263617  1.61908227
## [2,] -0.2910016 -0.60174048
## [3,]  0.2354286  0.20015812
## [4,] -0.5248227  1.36169006
## [5,]  0.4641920  0.75415847
## [6,]  0.2407842  0.06826688

We can also make a quick plot of the data here.

plot(x1,x2)

Now, we will perform K-Means on the data with the kmeans() function, and we can observe how the data are clustered by looking at the cluster variable from the resulting object.

set.seed(11)
acid.out.km <- kmeans(x,2,nstart=20)
acid.out.km$cluster

##  [1] 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1

We can also use a plot to illustrate the groups that the data have been arranged into.

plot(x,col=(acid.out.km$cluster+1),main='K-Means Clustering result with k = 2',xlab='',pch=20,cex=2)

In addition, we can look at the total within cluster sum of squares, with lower values indicating tighter, and hopefully more appropriate groups.

acid.out.km$tot.withinss #For total within cluster sum of squares.

## [1] 99.83094

To reduce this figure, we can try to cluster the data around three clusters and observe the results.

acid.out.km2 <- kmeans(x,3,nstart=20)
acid.out.km2$cluster

##  [1] 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
## [36] 2 3 2 2 2

acid.out.km2$tot.withinss

## [1] 49.38404

We were successful in reducing the total within cluster sum of squares. A plot can then be made illustrating the assignments.

plot(x,col=(acid.out.km2$cluster+1),main='K-Means Clustering result with k = 3',xlab='',pch=20,cex=2)

We will now demonstrate an example of Hierarchial Clustering using the all.mammals.milk.1956 data. However, we will have to prepare the data, scaling the data using the scale() function.

data(all.mammals.milk.1956)
leche <- all.mammals.milk.1956 #An additional object copy of the original data, for convenience.
leche.scaled <- scale(leche[,2:6]) #The numerical data is scaled..
leche.labels <- leche$name #and the labels for the data are put into its on vector.
crema <- as.data.frame(cbind(leche.labels,leche.scaled)) #Both objects are combined into a data.frame.

We’ll use the hclust() function to cluster the data, based upon the distances provided by the dist() function. For this example, we will use the ‘complete’ type of linkage, but others are available. The clustered data are then organized in a tree plot.

crema.complete <- hclust(dist(crema[,2:6]),method='complete')
plot(crema.complete,labels=leche.labels,main='Complete Linkage',hang=-1,xlab='',cex=.9)

We can also find the number of members assigned to each group depending on the number of groups in question using the cutree() function. For example, we will look at the number of members for each group for clusters of two to five groups with the following code.

counts <- sapply(2:5,function(xl)table(cutree(crema.complete,xl)))
counts

## [[1]]
## 
##  1  2 
## 14 11 
## 
## [[2]]
## 
##  1  2  3 
## 14  9  2 
## 
## [[3]]
## 
##  1  2  3  4 
## 14  8  1  2 
## 
## [[4]]
## 
##  1  2  3  4  5 
## 14  5  1  3  2

Using cutree() and table() functions, we can also discern which members are in which groups at a certain level.

groups4 <- cutree(crema.complete,4)
table(groups4)

## groups4
##  1  2  3  4 
## 14  8  1  2

##Now, who are the fourteen members of the first group? 
as.character(crema$leche.labels)[groups4 == 1] #as.character() used because the names are of 'factor' data type.

##  [1] "Horse"     "Orangutan" "Monkey"    "Donkey"    "Hippo"    
##  [6] "Camel"     "Bison"     "Buffalo"   "Fox"       "Llama"    
## [11] "Mule"      "Zebra"     "Sheep"     "Elephant"

In addition, we can use the color_labels() function from the dendextend library to visually seperate the groups at a particular level by color.

library(dendextend)

## 
## Welcome to dendextend version 1.1.2
## 
## Type ?dendextend to access the overall documentation and
## browseVignettes(package = 'dendextend') for the package vignette.
## You can execute a demo of the package via: demo(dendextend)
## 
## More information is available on the dendextend project web-site:
## https://github.com/talgalili/dendextend/
## 
## Contact: <tal.galili@gmail.com>
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## 
##          To suppress the this message use:
##          suppressPackageStartupMessages(library(dendextend))
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:stats':
## 
##     cutree

p1 <- hclust(dist(crema[,2:6]),method='complete')
p1$labels <- as.character(crema$leche.labels) #Adding labels to the cluster, otherwise you'll get numbers back
p1 <- color_labels(p1,4,col=c('red','blue','green','black')) #Cut the tree a four groups with four colors.
plot(p1,main='Fancy Plot')

Clustering Techniques

Pierre Phanor

December 30, 2015