Unsupervised Learning - Clustering Methods using K-means

Introduction

Clustering refers to a very broad set of techniques for finding sub-groups, or clusters, in a data set. Humans like to define things and seperate them into sub-groups, so that observations in the same group are similar to each other and those in different groups are more different to each other. Clustering seeks to find homogenous sub-groups by looking for structure within a dataset using unsupervised methods.

We use a simple example from an experiment completed during my PhD. The experiment involved injecting female beetles with double-stranded RNA (dsRNA) of a gene involved with sex-determination (Tctra). The females were then mated with normal males, the offspring of the cross were counted and sexed. RNA interference occurs when this injected dsRNA can mess up normal gene expression. This may manifest itself as non-Mendelian sex ratios or death. We use the glimpse() function to have a look at the dataframe.

## [1] "C:/Users/mammykins/Google Drive/R/blog_phd"

## Observations: 40
## Variables: 5
## $ treatment        (fctr) A, A, A, A, A, A, A, A, B, B, B, B, B, B, B,...
## $ male             (int) 1, 1, 45, 2, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1,...
## $ female           (int) 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, ...
## $ corrected_male   (dbl) 0, 0, 45, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ corrected_female (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

The data

An female was injected with one out of four potential dsRNA treatments, with an inbalanced design (due to variation in mortality). C and nodsrna are controls, A and B are expected to induce RNAi based on prior biological knowledge (not relevant here, but interesting!).

table(tidydata$treatment)

## 
##       A       B       C nodsrna 
##       8      13       8      11

Objective

Conduct K-means clustering on a simple dataset.
Plot the data and their respective groups.
Compare clusters with dsRNA injection treatment.

K-means clustering

We want to partition the observations into K clusters such that the within-cluster variation, summed over all K clusters, is as small as possible. We can use our biological background knowledge to specificy the expected number of clusters. Typically mated females produce both males and females in a 1:1 ratio. Sometimes no offspring are produced due to the physical damage assoicated with the injection. Thirdly we can test that the Tctra dsRNA causes RNA interference and prevents the female sex determination cascade thus producing only males. Thus we set the argument k = 3 in the kmeans() function.

#ANALYSIS
x <- cbind(tidydata$corrected_male, tidydata$corrected_female)
km.out <- kmeans(x = x, centers = 3, nstart = 50)
km.out

## K-means clustering with 3 clusters of sizes 29, 10, 1
## 
## Cluster means:
##         [,1]       [,2]
## 1  0.4137931  0.4137931
## 2 24.7000000 27.5000000
## 3 45.0000000  0.0000000
## 
## Clustering vector:
##  [1] 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 1
## [36] 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 278.069 824.600   0.000
##  (between_SS / total_SS =  91.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

tidydata$cluster <- factor(km.out$cluster)  # add the cluster variable to our dataframe
centers <- as.data.frame(km.out$centers)  # create centers dataframe for ggplot2, later

Interestingly only one observation was assigned to cluster 1, ten were assigned to cluster 2 and 29 to cluster 3. Most of the injected females failed to produce viable offspring.

Our data is suitable for this type of use as when the beetles breed they can give rise to both males and females. This can be represented on two-dimensions, or a typical scatter plot. Let’s plot our treatments by their first letter and distinguish which of the three clusters the observation has been assigned to using colour.

# Convert from letters to UTF-8 encoded "shapes", issue with legend
uniqueInitials <- c("A", "B", "C", "N")
initialShapes <- unlist(lapply(uniqueInitials, utf8ToInt))

Now let’s compare the observations treatment with the 3-means clustering and see how it grouped the observations. Can we detect any interesting patterns? One of the observations appears to be misclassified, compared to my gut instinct classifier.

#OUTPUT
p1 <- ggplot(tidydata, aes(x = corrected_male, y = corrected_female)) +
  geom_jitter(aes(shape = factor(tidydata$treatment),
                 colour = cluster),
              alpha = I(0.8), size = 5) +
  xlab("male") + ylab("female") +
  scale_shape_manual(values = initialShapes) +
  theme_bw() +
  theme(axis.title.x = element_text(face = "bold", colour = "black", size = 15),
        axis.title.y = element_text(face = "bold", colour = "black", size = 15),
        legend.position = "none"
  ) 
p1

Roughly it looks like we have lots of crosses that were sterile producing no offspring, a dozen that have approximately a 1:1 sex ratio and a single outlier of 44 sons and no daughters! For an explanation of the biology see the papers by Shukla & Palli, or my thesis.

We can add the group centres and a arbitary sized semi-transparent halo around the centres to emphasize the place of the centre and its role in classifying the observations into clusters.

#OUTPUT
p2 <- ggplot(tidydata, aes(x = corrected_male, y = corrected_female)) +
  geom_jitter(aes(shape = factor(tidydata$treatment),
                 colour = cluster),
              alpha = I(0.8), size = 5) +
  xlab("male") + ylab("female") +
  scale_shape_manual(values = initialShapes) +
  geom_point(data = centers, aes(x = V1, y = V2, color = "Center")) +  #Adds the centers
  geom_point(data = centers, aes(x = V1, y = V2,
                               color = "Center"), size = 12, alpha = .3) +  #Adds the halo
  theme_bw() +
  theme(axis.title.x = element_text(face = "bold", colour = "black", size = 15),
        axis.title.y = element_text(face = "bold", colour = "black", size = 15),
        legend.position = "none"
  ) 
p2

The default algorithm in kmeans() is that of Hartigan and Wong (1979), according to the help ?kmeans(). This method uses Euclidean distance, hence the observation at (12,12) being assigned to the red ~(0,0) center cluster, rather than the green.

Conclusion

We introduced the kmeans() function for grouping observations into clusters based on values in two dimensions. A simple biological example was used to elucidate the functions arguments and ggplot some of the output values.

References

James et al., (2014). An introduction to statistical learning with applications in R.
Shukla, J. N., & Palli, S. R. (2012). Sex determination in beetles: Production of all male progeny by Parental RNAi knockdown of transformer. Scientific Reports, 2, 1-9. http://doi.org/10.1038/srep00602
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108.

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.0.0 dplyr_0.4.3  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3      digest_0.6.9     assertthat_0.1   plyr_1.8.3      
##  [5] grid_3.2.3       R6_2.1.1         gtable_0.1.2     DBI_0.3.1       
##  [9] formatR_1.2.1    magrittr_1.5     scales_0.3.0     evaluate_0.8    
## [13] stringi_1.0-1    lazyeval_0.1.10  rmarkdown_0.9.2  labeling_0.3    
## [17] tools_3.2.3      stringr_1.0.0    munsell_0.4.2    yaml_2.1.13     
## [21] parallel_3.2.3   colorspace_1.2-6 htmltools_0.3    knitr_1.12