Reading in data

Let’s try doing a simple heatmap clustering and network analysis example with connectivity data from HCP, it’s super easy in R. You can read in data with read.table or read.csv, and be sure to set the delimiter. If you have string data, you will want to add stringsAsFactors = False, otherwise you have to deal with factors (which can introduce errors unknowingly). Only use factors if you know what you want to do with them!

data = read.table("parcel_matrix.csv",sep=",",head=TRUE)
# dim is the standard command to see size
dim(data)
## [1] 286 286
# head is the standard command to "peek" at the data, but I'll select a smaller subset to show
# head(data)
data[1:10,1:10]
##       X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
## 1    NaN 0.361 0.282 0.478 0.289 0.675 0.203 0.257 0.332 0.315
## 2  0.361   NaN 0.444 0.294 0.491 0.343 0.230 0.386 0.222 0.393
## 3  0.282 0.444   NaN 0.313 0.400 0.377 0.199 0.283 0.213 0.356
## 4  0.478 0.294 0.313   NaN 0.247 0.581 0.233 0.182 0.263 0.262
## 5  0.289 0.491 0.400 0.247   NaN 0.322 0.249 0.563 0.248 0.369
## 6  0.675 0.343 0.377 0.581 0.322   NaN 0.255 0.279 0.391 0.332
## 7  0.203 0.230 0.199 0.233 0.249 0.255   NaN 0.160 0.225 0.225
## 8  0.257 0.386 0.283 0.182 0.563 0.279 0.160   NaN 0.157 0.323
## 9  0.332 0.222 0.213 0.263 0.248 0.391 0.225 0.157   NaN 0.186
## 10 0.315 0.393 0.356 0.262 0.369 0.332 0.225 0.323 0.186   NaN

Let’s now read in the parcel data:

parcels = read.csv("parcels.csv",sep=",",head=TRUE,stringsAsFactors=FALSE)
parcels[1:10,]
##    ParcelID Hem Surface.area..mm2.      Community ID
## 1         1   L          1411.6436        Default  4
## 2         2   L           377.2543         SMhand 10
## 3         3   L           437.1466        SMmouth 11
## 4         4   L           368.6362        Default  4
## 5         5   L           870.9255         Visual 13
## 6         6   L          1083.6667        Default  4
## 7         7   L            91.3224 FrontoParietal  6
## 8         8   L           486.7238         Visual 13
## 9         9   L            95.0010 FrontoParietal  6
## 10       10   L           179.8312       Auditory  1

Make some labels

Let’s use the IDs to generate a label that has Hemisphere.Community.ID

labels = paste(parcels$Hem,parcels$Community,parcels$ParcelID,sep=".")
labels[1:10]
##  [1] "L.Default.1"        "L.SMhand.2"         "L.SMmouth.3"       
##  [4] "L.Default.4"        "L.Visual.5"         "L.Default.6"       
##  [7] "L.FrontoParietal.7" "L.Visual.8"         "L.FrontoParietal.9"
## [10] "L.Auditory.10"

Set the rownames and column names to be the labels, based on the parcel id. It’s probably a 1..333 order, but it’s dangerous to assume it’s always like that. R also adds “X” to numerical column names, so we will get rid of that.

colnames(data) = gsub("X","",colnames(data))
rowidx = match(rownames(data),parcels$ParcelID) 
colidx = match(colnames(data),parcels$ParcelID)
colnames(data) = labels[colidx]
rownames(data) = labels[rowidx]
data[1:10,1:10]
##                    L.Default.1 L.SMhand.2 L.SMmouth.3 L.Default.4
## L.Default.1                NaN      0.361       0.282       0.478
## L.SMhand.2               0.361        NaN       0.444       0.294
## L.SMmouth.3              0.282      0.444         NaN       0.313
## L.Default.4              0.478      0.294       0.313         NaN
## L.Visual.5               0.289      0.491       0.400       0.247
## L.Default.6              0.675      0.343       0.377       0.581
## L.FrontoParietal.7       0.203      0.230       0.199       0.233
## L.Visual.8               0.257      0.386       0.283       0.182
## L.FrontoParietal.9       0.332      0.222       0.213       0.263
## L.Auditory.10            0.315      0.393       0.356       0.262
##                    L.Visual.5 L.Default.6 L.FrontoParietal.7 L.Visual.8
## L.Default.1             0.289       0.675              0.203      0.257
## L.SMhand.2              0.491       0.343              0.230      0.386
## L.SMmouth.3             0.400       0.377              0.199      0.283
## L.Default.4             0.247       0.581              0.233      0.182
## L.Visual.5                NaN       0.322              0.249      0.563
## L.Default.6             0.322         NaN              0.255      0.279
## L.FrontoParietal.7      0.249       0.255                NaN      0.160
## L.Visual.8              0.563       0.279              0.160        NaN
## L.FrontoParietal.9      0.248       0.391              0.225      0.157
## L.Auditory.10           0.369       0.332              0.225      0.323
##                    L.FrontoParietal.9 L.Auditory.10
## L.Default.1                     0.332         0.315
## L.SMhand.2                      0.222         0.393
## L.SMmouth.3                     0.213         0.356
## L.Default.4                     0.263         0.262
## L.Visual.5                      0.248         0.369
## L.Default.6                     0.391         0.332
## L.FrontoParietal.7              0.225         0.225
## L.Visual.8                      0.157         0.323
## L.FrontoParietal.9                NaN         0.186
## L.Auditory.10                   0.186           NaN

Do the heatmap!

Now let’s do a basic heatmap clustering, and we will cluster both rows and columns. I like to use pheatmap because it’s not ugly.

library(pheatmap) #install with install.packages("pheatmap")
pheatmap(data,main="Heatmap clustering of HCP rsfMRI Connectivity")

And of course we want to get the groups from that. How does pheatmap work again?

?pheatmap

Take a look at the different input parameters - we can specify how we want to do the clustering (hierarchical or kmeans and specify k) - it will accept the same arguments as the R standard function “hclust” so type ?hclust to see what those are), as well as the similarity metric (e.g., euclidean, correlation) and you can also choose to cluster the rows or columns (or not). For example, here is the matrix as is, without any clustering:

pheatmap(data,cluster_rows=FALSE,cluster_cols=FALSE,main="HCP Matrix, no clustering")

This would be worthwhile to also do with the raw data - clustering a matrix of correlations is (I think) finding groups that have similar correlation relationships, and that is (possibly?) different from a clustering starting with raw timeseries. This strategy may be totally inappropriate for that.

Make it pretty

pheatmap is > heatmap or heatmap2 because it’s so darn pretty. If you use either of the latter you only get an ok looking plot if it’s relatively small. ggplot2 can also make nice heatmaps, but it’s SO much harder to use. Here are some params to be aware of:

  • look at “annotation_colors” if you want custom coloring of the trees
  • look at show_row/colnames if you want to get rid of those
  • look at fontsize_* to customize fonts
  • filename,width,height to get it directly to a nice picture! :)

Getting the result

mappy = pheatmap(data,main="Heatmap clustering of HCP rsfMRI Connectivity")

# Here is the row ordering after clustering
mappy$tree_row # same for mappy$tree_col
## 
## Call:
## hclust(d = d, method = method)
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 286

And to look at what you can extract from that, do: mappy\(tree_row\)[PRESS TAB TO AUTOCOMPLETE]. Then to save stuffs, you can just do:

save(mappy,file="mydata.Rda")
# and then to load
load("mydata.Rda")

Boom! Super easy! You should also check out the package “sparcl” for sparse (hierarhical and kmeans) clustering (eg, elastic net in an unsupervised context). Now let’s do a quick network analysis. We COULD do this:

library(qgraph) # install.packages("qgraph")
# We COULD do this...
# qg = qgraph(data, layout="spring",
#            label.cex=0.5, labels=colnames(data), 
#            label.scale=FALSE,
#            title="HCP Network!",
#            posCol="green",negCol="red",
#            color = "purple")

but that might not be smart because we need to threshold our data first! It will probably take forever to run and produce a stinky green cloud.

# make thresholded matrix
cat("How many values are not zero before?\n")
## How many values are not zero before?
length(which(data!=0))
## [1] 81510
thresh = data
thresh[thresh<.5] = 0

cat("How many values are not zero after?\n")
## How many values are not zero after?
length(which(thresh!=0))
## [1] 1432
qg = qgraph(thresh, layout="spring",
            label.cex=0.5, labels=colnames(thresh), 
            label.scale=FALSE,
            title="HCP Network Thresholded at 0.5!",
            posCol="green",negCol="red",
            color = "purple")

And take a look at the output to get information about the clustering, specifically, the connections:

# qg$Edgelist
# qg$graphAttributes$[PRESS TAB]

Remember that any kind of data you produce in R that is static, we can easily export the data and make into something interactive. You can also try shiny, or figure out an analysis / algorithm you like in R, and implement it in python (because I sometimes just like python better.)