To download and install this package,
install.packages("devtools")
library(devtools)
install_github("kkdey/CountClust")
Load the package in R
library(CountClust)
CountClust takes as input a counts table with samples along the rows (which we want to cluster) and features along the columns (which represents the dimensionality of the clusters). The package fits a topic mixed membersjip model on the samples based on these features and then visulaize and annotate these clusters to extract the top features driving each cluster.
Load the counts data (samples along the rows, variables or features along the columns) in R. Here we use a test data comprising of the brain samples in GTEX V6 data
brain_data <- t(data.frame(data.table::fread("test/cis_gene_expression_brain.txt"))[,-(1:2)]);
dim(brain_data)
As data preprocessing step, once can delete or substitute the columns that contain NAs. If for a feature, the proportion of NAs is greater than threshold proportion (thresh_prop), then we remove the feature, otherwise we use MAR substitution scheme using the distribution of the non NA values for the feature.
handleNA(data, thresh_prop=0)
handleNA(data, thresh_prop=0.2)
Next we remove features which are too sparse (sparsity threshold provided by the user).This function deals with zero counts in the counts dataset. If for a feature, the proportion of zeros across the samples is greater than filter_prop, then we remove the feature.
RemoveSparseFeatures(data,filter_prop=0.5)
Once the data preprocessing step is completed, we perform the mixed membership clustering of the samples based on the counts expression values across the features. We use the R package maptpx package due to Matt Taddy. The StructureObj() function performs this step and saves the model output as rda file in the path provided by the user.
Assume that metadata is a n (samples) by t (metadata labels) matrix that contains values for each metadata recorded in each column of this matrix.
We load the brain metadata (here t=1) for our example.
metadata <- cbind.data.frame(read.table("test/metadata_brain.txt")[,3]);
colnames(metadata) <- "brain_labs"
dim(metadata)
Before applying the model, create the directories to save the rda file and the cluster plots.
if(!dir.exists("test/Structure")) dir.create("test/Structure")
if(!dir.exists("test/Structure")) dir.create("test/Structure")
If the user wants to fit clustering for 2,3,4
clus_vec <- 3;
StructureObj(brain_data, nclus_vec=clus_vec, samp_metadata = metadata, tol=0.1, batch_lab = NULL,
plot=TRUE, path_rda="test/topics_data.rda", path_struct="test/Structure")
clus_vec <- c(2,3,4);
StructureObj(brain_data, nclus_vec=clus_vec, samp_metadata = metadata, tol=0.1, batch_lab = NULL,
plot=FALSE, path_rda="test/topics_data.rda", path_struct="test/Structure")
If plot=TRUE, the above function will create the Structure plot visualization of the clusters, one for each of the cluster values selected and arrangement of bars due to each of the metadata columns.
This function will output two things:
The rda file contains the W (the topic proportion matrix) and T (topic distribution) files, so if you need to change the Structure plot, the user need not rerun the StructureObj() fit again and you can just load the W matrix in rda file into out StrutureObj_omega() function.
brain_structure <- get(load("test/topics_data.rda"));
StructureObj_omega(brain_structure[[1]]$omega, samp_metadata = brain_metadata,
batch_lab = NULL, path_struct="test/Structure",
control=list(cex.axis=1, struct.width=500, struct.height=500))
Then go to test/Structure and you will be able to see the Structure plot for K=3 with width and the height of the Structure plot modified.
The Structure plot generated looks like the following
To extract the variables that drive the clusters (cluster annotation)
theta <- brain_structure[[1]]$theta;
features <- ExtractTopFeatures(theta,top_features=50,method="poisson")
It will provide you with a list of top 50 variables/features per cluster that drive that cluster to be different from the rest.
0.0.1
The CountClust package is distributed under [GPL - General Public License (>= 2)]
For any queries, contact kkdey@uchicago.edu