Clustering using k-means in R and plotting using Lattice

Data Analysis Topics Summary

Topics covered in this report:

How to cluster using kmeans
How to operate on only a subset of data from a data table using .SD and .SD cols
How to tidy data for plotting using rep and sapply
How to use the panel function with two different data sets
How to plot two different data sets in the same Lattice plot
Full raw data set available here
Github repository: https://github.com/mzhKU/Enzyme-Barrier-Clustering

Data Preparation

Clean and tidy data was required for the various analysis steps. Outliers are removed according to the rule that no data point > 50 kcal/mol is included. In a data.table object, this can be done on one line using:

detect_outliers <- function(d)
{
    which(rowSums(d[, .SD>max_interpolation_value, .SDcols=2:13])>0)
}

Using .SD, effectively every entry of the data set can be compared against the defined upper limit value. Futhermore, the function makes use of the fact that R (like e.g. Python) treats TRUE values as 1: In Python:

>>> (3 < 6) + 1
2

In R:

> (3 < 6) + 1
[1] 2

The raw data without outliers dout is in this format (data rounded for readability):

##    Mutant  F1  F2  F3  F4  F5   F6   F7   F8   F9 F10 F11 F12
## 1:  A115D 6.5 7.2 8.5 8.5 9.3 16.8 10.7 16.1 12.0 7.9 3.5   0
## 2:  A115F 0.9 4.3 5.7 6.6 5.9  9.4 14.6 17.4 14.2 8.9 2.0   0
## 3:  A115G 3.3 4.5 5.4 7.8 8.5 10.9 11.3 14.9 10.7 7.1 3.0   0

Each row is refered to as an individual entry and the values at interpolation frames F1, F2, … (i.e. the x-axis) are relative to F12, which is set as the zero point.

Total raw data looks like this (in total 303 individual entries):

[ADJUST: Increase label font size, “F1”, “F2”, … as x-axis labels, x-axis tics at each interpolation step]

Figure created using png device and the width and height options.

Each line (“profile”) is the calculated activation energy of an enzyme mutant (y-axis, in kcal/mol) along an interpolated reaction coordinate (x-axis, in interpolation steps). It appears as if overall, the profiles follow some sort of barrier shape when going from x=1 to x=12. What I wanted to see though was if there were regiments of profiles which resemble each other more than others.

In principle, such data could result also from sources like time series or industrial process data where certain groups of profiles would correspond to production within specification limits while others would be out of specification.

Clustering using `kmeans`

In this format, the (numerical) data can be clustered using the call

km <- kmeans(d[, .SD, .SDcols=2:13], centers=n_centers, nstart=25, iter=25)

Here, .SD indicates that one operates on a subset of data, consisting of columns 2:13. We generate n_centers clustering centers. The returned object offers, among more, the calculated cluster centers, assigned cluster per data set entry and the size of each cluster.

The clusters assigned by kmeans are appended to the data.table object as a new column:

dout[, cluster := km$cluster]

Preparing data frames for lattice plotting

Cluster centers:
For convenient manipulation, first the cluster centers are defined as a separate data.table object:

kmdt <- as.data.table(km$centers)

For plotting in Lattice, the data.table object is recast to a data frame and transposed using the t operator on columns 2:13. Additionally, the cluster center vector is expanded by a factor letter indicator for later grouping in the lattice plot. Finally, x-axis integers are added to the data frame:

kmdf <- data.frame(Cluster_fac = rep(LETTERS[1:n_centers], each=12),
                   Barrier_Cl  = c(round(kmdt[, t(.SD)], 5))
)
kmdf$x <- rep(1:12, times=length(LETTERS[1:n_centers]))

# > head(kmdf, 20)
#    Var2 Barrier_Cl  x
# 1     A    1.27873  1
# 2     A    1.83650  2
# [...]
# 12    A    0.00000 12
# 13    B   -2.55337  1
# 14    B   -1.78579  2
# [...]

The each option to the rep function is set to 12, because each cluster center has 12 components and then 12 times the number of cluster centers (n_centes) is added to the data frame. The kmdf object is the first data set used in the Lattice plot.

Individual data set entries:
The individual entries are prepared in a similar way for the overlay plot.

 iedf <- data.frame(x_axis    = rep(1:12, times=nrow(dout)),
                    id        = rep(1:nrow(dout), each=12, times=1),
                    Barrier_i = c(dout[, t(.SD), .SDcols=2:13]),
                    Cluster   = as.factor(c(sapply(km$cluster, rep, 12)))
 )
 
 # > head(iedf, 20)
 #    x_axis id Barrier_i Cluster
 # 1       1  1   6.53193       7
 # 2       2  1   7.19020       7
 # [...]
 # 13      1  2   0.90377      24
 # 14      2  2   4.33477      24
 # 15      3  2   5.71287      24
 # [...]

The id option to the second rep call is required such that in the lines plot of the xyplot, the line does not continue from the end of one entry to the start of the next entry. Every individual entry has an individual group id. The entries of the data.table can be combined to one single column vector using the c operator. Finally, a column vector of the assigned clusters is prepared (of the same length like the total data frame, 3636 entries). The iedf data frame is the second data set used in the Lattice plot.

Lattice plotting of two data sets

In the book on the lattice library, a chapter describes how to use the panel function to plot two data sets in the same lattice plot. The total function all looks like this:

xyplot(Barrier_Cl ~ x|Cluster_fac, data=kmdf, iedf_i=iedf,
       ylim=c(-20, 50), ylab="Reaction Energy [kcal/mol]",
       xlab="Reaction Coordinate", xlim=c(1, 12), strip=T,
       panel = function(x, y, iedf_i)
       {
            panel.xyplot(x=iedf_i[iedf_i$Cluster==panel.number(), ]$x_axis,
                         y=iedf_i[iedf_i$Cluster==panel.number(), ]$Barrier_i,
                         group=iedf_i$id, subscripts=TRUE, type="o", col="blue")
            panel.xyplot(x, y, type="l", col="red", lwd=2)
       }
)

First, the high level function call to xyplot gives sort of the definition of the plot. The data sets kmdf (cluster centers) and iedf (individual entries) are passed in seperately, iedf as an additional argument (with seperate identifier iedf_i).
Within the high-level function, the panel function is defined, with in turn calls the low level functions to construct the layered trellis plot. The panel function accepts x and y values and a data set, which is passed on to its inner panel.xyplot function.
The x and y in function(x, y, iedf_i) however are the top level x- and y-values from the kmdf data set. The x and y values from the iedf data set are defined only within the panel function and must be adjusted in length using the panel.number() inner function call (to match each selection to the currently plotted panel).

Background

The data is from my paper A computational method for the systematic screening of reaction barriers in enzymes: searching for Bacillus circulans xylanase mutants with greater activity towards a synthetic substrate (Link: 10.7717/peerj.111).