Installation

To download and install this package,

install.packages("devtools")
library(devtools)
install_github("kkdey/CountClust")

Load the package in R

library(CountClust)

Data Format

CountClust takes as input a counts table with samples along the rows (which we want to cluster) and features along the columns (which represents the dimensionality of the clusters). The package fits a topic mixed membersjip model on the samples based on these features and then visulaize and annotate these clusters to extract the top features driving each cluster.

Data Loading and Preprocessing

Load the counts data (samples along the rows, variables or features along the columns) in R. Here we use a test data comprising of the brain samples in GTEX V6 data

brain_data <- t(data.frame(data.table::fread("test/cis_gene_expression_brain.txt"))[,-(1:2)]);
dim(brain_data)

As data preprocessing step, once can delete or substitute the columns that contain NAs. If for a feature, the proportion of NAs is greater than threshold proportion (thresh_prop), then we remove the feature, otherwise we use MAR substitution scheme using the distribution of the non NA values for the feature.

handleNA(data, thresh_prop=0)
handleNA(data, thresh_prop=0.2)

Next we remove features which are too sparse (sparsity threshold provided by the user).This function deals with zero counts in the counts dataset. If for a feature, the proportion of zeros across the samples is greater than filter_prop, then we remove the feature.

RemoveSparseFeatures(data,filter_prop=0.5)

Model fitting

Once the data preprocessing step is completed, we perform the mixed membership clustering of the samples based on the counts expression values across the features. We use the R package maptpx package due to Matt Taddy. The StructureObj() function performs this step and saves the model output as rda file in the path provided by the user.

Assume that metadata is a n (samples) by t (metadata labels) matrix that contains values for each metadata recorded in each column of this matrix.

We load the brain metadata (here t=1) for our example.

metadata <- cbind.data.frame(read.table("test/metadata_brain.txt")[,3]);
colnames(metadata) <- "brain_labs"
dim(metadata)

Before applying the model, create the directories to save the rda file and the cluster plots.

if(!dir.exists("test/Structure")) dir.create("test/Structure")
if(!dir.exists("test/Structure")) dir.create("test/Structure")

If the user wants to fit clustering for 2,3,4

clus_vec <- 3;
StructureObj(brain_data, nclus_vec=clus_vec, samp_metadata = metadata, tol=0.1, batch_lab = NULL,
             plot=TRUE, path_rda="test/topics_data.rda", path_struct="test/Structure")
clus_vec <- c(2,3,4);
StructureObj(brain_data, nclus_vec=clus_vec, samp_metadata = metadata, tol=0.1, batch_lab = NULL,
             plot=FALSE, path_rda="test/topics_data.rda", path_struct="test/Structure")
                         

If plot=TRUE, the above function will create the Structure plot visualization of the clusters, one for each of the cluster values selected and arrangement of bars due to each of the metadata columns.

This function will output two things:

The rda file contains the W (the topic proportion matrix) and T (topic distribution) files, so if you need to change the Structure plot, the user need not rerun the StructureObj() fit again and you can just load the W matrix in rda file into out StrutureObj_omega() function.

brain_structure <- get(load("test/topics_data.rda"));
StructureObj_omega(brain_structure[[1]]$omega, samp_metadata = brain_metadata, 
                          batch_lab = NULL, path_struct="test/Structure",
                          control=list(cex.axis=1, struct.width=500, struct.height=500))

Then go to test/Structure and you will be able to see the Structure plot for K=3 with width and the height of the Structure plot modified.

The Structure plot generated looks like the following

Cluster annotations

To extract the variables that drive the clusters (cluster annotation)

theta <- brain_structure[[1]]$theta;
features <- ExtractTopFeatures(theta,top_features=50,method="poisson")

It will provide you with a list of top 50 variables/features per cluster that drive that cluster to be different from the rest.

Version

0.0.1

Licenses

The CountClust package is distributed under [GPL - General Public License (>= 2)]

Contact

For any queries, contact kkdey@uchicago.edu

Acknowledgements