Let’s try doing a simple heatmap clustering and network analysis example with connectivity data from HCP, it’s super easy in R. You can read in data with read.table or read.csv, and be sure to set the delimiter. If you have string data, you will want to add stringsAsFactors = False, otherwise you have to deal with factors (which can introduce errors unknowingly). Only use factors if you know what you want to do with them!
data = read.table("parcel_matrix.csv",sep=",",head=TRUE)
# dim is the standard command to see size
dim(data)
## [1] 286 286
# head is the standard command to "peek" at the data, but I'll select a smaller subset to show
# head(data)
data[1:10,1:10]
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 NaN 0.361 0.282 0.478 0.289 0.675 0.203 0.257 0.332 0.315
## 2 0.361 NaN 0.444 0.294 0.491 0.343 0.230 0.386 0.222 0.393
## 3 0.282 0.444 NaN 0.313 0.400 0.377 0.199 0.283 0.213 0.356
## 4 0.478 0.294 0.313 NaN 0.247 0.581 0.233 0.182 0.263 0.262
## 5 0.289 0.491 0.400 0.247 NaN 0.322 0.249 0.563 0.248 0.369
## 6 0.675 0.343 0.377 0.581 0.322 NaN 0.255 0.279 0.391 0.332
## 7 0.203 0.230 0.199 0.233 0.249 0.255 NaN 0.160 0.225 0.225
## 8 0.257 0.386 0.283 0.182 0.563 0.279 0.160 NaN 0.157 0.323
## 9 0.332 0.222 0.213 0.263 0.248 0.391 0.225 0.157 NaN 0.186
## 10 0.315 0.393 0.356 0.262 0.369 0.332 0.225 0.323 0.186 NaN
Let’s now read in the parcel data:
parcels = read.csv("parcels.csv",sep=",",head=TRUE,stringsAsFactors=FALSE)
parcels[1:10,]
## ParcelID Hem Surface.area..mm2. Community ID
## 1 1 L 1411.6436 Default 4
## 2 2 L 377.2543 SMhand 10
## 3 3 L 437.1466 SMmouth 11
## 4 4 L 368.6362 Default 4
## 5 5 L 870.9255 Visual 13
## 6 6 L 1083.6667 Default 4
## 7 7 L 91.3224 FrontoParietal 6
## 8 8 L 486.7238 Visual 13
## 9 9 L 95.0010 FrontoParietal 6
## 10 10 L 179.8312 Auditory 1
Let’s use the IDs to generate a label that has Hemisphere.Community.ID
labels = paste(parcels$Hem,parcels$Community,parcels$ParcelID,sep=".")
labels[1:10]
## [1] "L.Default.1" "L.SMhand.2" "L.SMmouth.3"
## [4] "L.Default.4" "L.Visual.5" "L.Default.6"
## [7] "L.FrontoParietal.7" "L.Visual.8" "L.FrontoParietal.9"
## [10] "L.Auditory.10"
Set the rownames and column names to be the labels, based on the parcel id. It’s probably a 1..333 order, but it’s dangerous to assume it’s always like that. R also adds “X” to numerical column names, so we will get rid of that.
colnames(data) = gsub("X","",colnames(data))
rowidx = match(rownames(data),parcels$ParcelID)
colidx = match(colnames(data),parcels$ParcelID)
colnames(data) = labels[colidx]
rownames(data) = labels[rowidx]
data[1:10,1:10]
## L.Default.1 L.SMhand.2 L.SMmouth.3 L.Default.4
## L.Default.1 NaN 0.361 0.282 0.478
## L.SMhand.2 0.361 NaN 0.444 0.294
## L.SMmouth.3 0.282 0.444 NaN 0.313
## L.Default.4 0.478 0.294 0.313 NaN
## L.Visual.5 0.289 0.491 0.400 0.247
## L.Default.6 0.675 0.343 0.377 0.581
## L.FrontoParietal.7 0.203 0.230 0.199 0.233
## L.Visual.8 0.257 0.386 0.283 0.182
## L.FrontoParietal.9 0.332 0.222 0.213 0.263
## L.Auditory.10 0.315 0.393 0.356 0.262
## L.Visual.5 L.Default.6 L.FrontoParietal.7 L.Visual.8
## L.Default.1 0.289 0.675 0.203 0.257
## L.SMhand.2 0.491 0.343 0.230 0.386
## L.SMmouth.3 0.400 0.377 0.199 0.283
## L.Default.4 0.247 0.581 0.233 0.182
## L.Visual.5 NaN 0.322 0.249 0.563
## L.Default.6 0.322 NaN 0.255 0.279
## L.FrontoParietal.7 0.249 0.255 NaN 0.160
## L.Visual.8 0.563 0.279 0.160 NaN
## L.FrontoParietal.9 0.248 0.391 0.225 0.157
## L.Auditory.10 0.369 0.332 0.225 0.323
## L.FrontoParietal.9 L.Auditory.10
## L.Default.1 0.332 0.315
## L.SMhand.2 0.222 0.393
## L.SMmouth.3 0.213 0.356
## L.Default.4 0.263 0.262
## L.Visual.5 0.248 0.369
## L.Default.6 0.391 0.332
## L.FrontoParietal.7 0.225 0.225
## L.Visual.8 0.157 0.323
## L.FrontoParietal.9 NaN 0.186
## L.Auditory.10 0.186 NaN
Now let’s do a basic heatmap clustering, and we will cluster both rows and columns. I like to use pheatmap because it’s not ugly.
library(pheatmap) #install with install.packages("pheatmap")
pheatmap(data,main="Heatmap clustering of HCP rsfMRI Connectivity")
And of course we want to get the groups from that. How does pheatmap work again?
?pheatmap
Take a look at the different input parameters - we can specify how we want to do the clustering (hierarchical or kmeans and specify k) - it will accept the same arguments as the R standard function “hclust” so type ?hclust to see what those are), as well as the similarity metric (e.g., euclidean, correlation) and you can also choose to cluster the rows or columns (or not). For example, here is the matrix as is, without any clustering:
pheatmap(data,cluster_rows=FALSE,cluster_cols=FALSE,main="HCP Matrix, no clustering")
This would be worthwhile to also do with the raw data - clustering a matrix of correlations is (I think) finding groups that have similar correlation relationships, and that is (possibly?) different from a clustering starting with raw timeseries. This strategy may be totally inappropriate for that.
pheatmap is > heatmap or heatmap2 because it’s so darn pretty. If you use either of the latter you only get an ok looking plot if it’s relatively small. ggplot2 can also make nice heatmaps, but it’s SO much harder to use. Here are some params to be aware of:
mappy = pheatmap(data,main="Heatmap clustering of HCP rsfMRI Connectivity")
# Here is the row ordering after clustering
mappy$tree_row # same for mappy$tree_col
##
## Call:
## hclust(d = d, method = method)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 286
And to look at what you can extract from that, do: mappy\(tree_row\)[PRESS TAB TO AUTOCOMPLETE]. Then to save stuffs, you can just do:
save(mappy,file="mydata.Rda")
# and then to load
load("mydata.Rda")
Boom! Super easy! You should also check out the package “sparcl” for sparse (hierarhical and kmeans) clustering (eg, elastic net in an unsupervised context). Now let’s do a quick network analysis. We COULD do this:
library(qgraph) # install.packages("qgraph")
# We COULD do this...
# qg = qgraph(data, layout="spring",
# label.cex=0.5, labels=colnames(data),
# label.scale=FALSE,
# title="HCP Network!",
# posCol="green",negCol="red",
# color = "purple")
but that might not be smart because we need to threshold our data first! It will probably take forever to run and produce a stinky green cloud.
# make thresholded matrix
cat("How many values are not zero before?\n")
## How many values are not zero before?
length(which(data!=0))
## [1] 81510
thresh = data
thresh[thresh<.5] = 0
cat("How many values are not zero after?\n")
## How many values are not zero after?
length(which(thresh!=0))
## [1] 1432
qg = qgraph(thresh, layout="spring",
label.cex=0.5, labels=colnames(thresh),
label.scale=FALSE,
title="HCP Network Thresholded at 0.5!",
posCol="green",negCol="red",
color = "purple")
And take a look at the output to get information about the clustering, specifically, the connections:
# qg$Edgelist
# qg$graphAttributes$[PRESS TAB]
Remember that any kind of data you produce in R that is static, we can easily export the data and make into something interactive. You can also try shiny, or figure out an analysis / algorithm you like in R, and implement it in python (because I sometimes just like python better.)