Answers should be returned in PDF format by 9.00am of Wednesday 15th of February to alexandra.lahtinen@helsinki.fi
Download the table ALL_Expression_Data.csv from Moodle. The file containts Acute Lymphoblastic Leukemia (ALL) differentially expressed genes, obtained through microarray experiments. Load the matrix into R, report the number of rows and columns and transpose the matrix, so that the rows take the place of the columns and vice versa.
Perform hierarchical clustering of the columns to see if there are groups of similar columns. You can use the default euclidean distance when computing the distance metrics. Do two different clusterings using single and complete linkage. Visualize your clustering. You may use png() and dev.off() functions to save your figures automatically. How do clusterings obtained with different linkage differ from each other? (Hint: The package cluster contains several functions to do clustering, or you may use dist and hclust functions).
Make a heatmap plot of your data. Are the data clustered in the same way as in one of your previous clustering plots? What is visualized in addition? (Hint: function heatmap)
Download the table OV_TCGA_Mutation.csv from Moodle. This table contains binary values representing whether or not a certain gene is mutated in a certain patient. The table was obtained using the Ovarian Cancer TCGA data used in exercise set 1 and a subset of 30 patient was selected for this exercise. Perform hierarchical clustering of these binary data.
# Set working directory
setwd("~/Desktop/IntroBioinfo2017/Exercise3/")
# Load data into R
expr <- read.table("ALL_Expression_Data.csv", sep = "\t", header = T,
stringsAsFactors = F, row.names = 1)
# Check data frame dimension
dim(expr)
## [1] 165 33
# Transpose the data frame
expr.t <- t(expr)
# Perform complete linkage clustering and save your plot
hc.complete <- hclust(dist(expr.t, method = "euclidean"),
method="complete")
plot(hc.complete, xlab="ALL patients", main = "Complete linkage clustering of differentially expressed genes", sub="")
# Perform single linkage clustering and save your plot
hc.single <- hclust(dist(expr.t, method = "euclidean"),
method="single")
plot(hc.single, xlab="ALL patients", main = "Single linkage clustering of differentially expressed genes", sub="")
heatmap(as.matrix(expr), xlab = "ALL Patients", ylab="Genes", main = "Heatmap of differentially expressed genes")
# Load data
mutation.data <- read.table("OV_TCGA_Mutation.csv", sep="\t", row.names = 1,
header = T, stringsAsFactors = F)
# Plot binary clustering
plot(hclust(dist(t(mutation.data), method = "binary")),
main = "Binary clustering of mutation data",
sub = "", xlab = "Ovarian Cancer TCGA Patients")
Perform Principal Component Analysis (PCA) on the ALL expression dataset and plot the first two principal components. Do you see any similarity with the hierarchical clustering? (Hint: check the package pcaMethods in Bioconductor)
Now perform and visualize Multidimentional Scaling (MDS). Compare this plot with the one obtained through PCA. (Hint: Check the cmdscale function)
# Loading pcaMethods library
library(pcaMethods)
# Selecting colors based on the complete linkage clustering
groups <- data.frame(Pos = c(hc.complete$order[1:10], hc.complete$order[11:length(hc.complete$order)]),
Col = c(rep("darkred", 10), rep("darkgreen", length(hc.complete$order)-10)),
stringsAsFactors = F)
color <- groups$Col[order(groups$Pos)]
# PCA
par(oma=c(5,5,2,0), mar=c(1,1,1,1))
slplot(pca(expr), scoresLoadings=c(FALSE, TRUE), lcol = color, lcex=1.3, lann=F)
title("PCA", outer=T, xlab="PC1", ylab = "PC2")
# MDS
mds <- cmdscale(dist(expr.t, method = "euclidean"),eig=TRUE, k=2)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="MDS", type="n")
text(x, y, labels = row.names(expr.t), cex=1.3, col=color)
To complete this task you will need to read the paper by Alexandrov et al., “Signatures of mutational processes in human cancer”, Nature, 2013. You can find the paper in Moodle. In this task we are also using the NMF and maftools packages and the Ovarian Cancer TCGA mutation data we analyzed in exercise set 1. Read the allSamplesMaf.maf file you created in exercise set 1 and create a maf object. If you are running on a windows machine, download the precomputed trinucleotide matrix from here to your data folder, then use it for the following task. You can read the file into R using this command:
dataPath <- "~/Desktop/IntroBioinfo2017/Exercise3"
tnm <- read.table(file.path(dataPath, "tnm.csv"),header = T, stringsAsFactors = F, row.names = 1, check.names = F, sep= "\t")
If you want to extract the trinucleotide matrix yourself and you are working on linux machine, you can download the needed hg19 reference fasta file from here and its index from here to your data folder and then extract the files.
You will also need to use the TCGA barcodes of patients having either BRCA1 or BRCA2 mutations, which you have identified in exercise set 1.
brcaMutated <- c("TCGA-04-1331-01", "TCGA-04-1357-01", "TCGA-09-2050-01", "TCGA-13-0730-01", "TCGA-13-0804-01", "TCGA-13-0885-01", "TCGA-13-0890-01", "TCGA-13-1481-01", "TCGA-13-1489-01", "TCGA-23-1026-01", "TCGA-23-1030-01", "TCGA-24-1103-01", "TCGA-24-1555-01", "TCGA-24-2035-01", "TCGA-25-1625-01", "TCGA-25-1630-01", "TCGA-25-1632-01", "TCGA-29-2427-01", "TCGA-13-0726-01", "TCGA-24-1470-01")
Extract mutational signatures from mutation data, how many signatures do you find based on rank estimation and to which validated signatures are they most similar? (Hint: use the prefix = ‘chr’ and add = TRUE when obtaining trinucleotideMatrix if you are extracting it yourself).
Alexandrov et al. found two mutational signatures to be active in ovarian cancer, what is the proposed aetiology for each of these two signatures? If you found a different number of signatures in task (a), repeat the extraction with manually specified rank =2.
One of these two signatures has been associated with BRCA1 and BRCA2 mutations. Compare the contribution of that signature between BRCA1/2 mutated cancers and the rest. A boxplot is a good visualization option here. Is the difference statistically significant? (Hint: The contribution matrix has sample barcodes with “-” changed to “.”, you need to change them back so they are comparable to the sample barcodes in brcaMutated. If you downloaded the precomputed matrix, you don’t have to worry about this).
library(NMF)
library(maftools)
# If you have a linux computer you can run the following lines
# dataPath <- "~/Desktop/IntroBioinfo2017/Exercise3"
# maf <- read.maf(maf = "~/Desktop/IntroBioinfo2017/Exercise1/allSamplesMaf.maf",
# removeSilent = T, useAll = T , verbose = F)
# Let's get the trinucleotide matrix using the downloaded FASTA file
# tnm <- trinucleotideMatrix(maf=maf, ref_genome = file.path(dataPath,"ucsc.hg19.fasta"),
# prefix = 'chr', add = TRUE)
set.seed(1234)
sign <- extractSignatures(mat = tnm, nTry = 6)
## Estimating best rank..
## method seed rng metric rank sparseness.basis sparseness.coef rss
## 1: brunet random 1 KL 2 0.3324409 0.3152792 13969.86
## 2: brunet random 1 KL 3 0.3720326 0.3364052 13684.64
## 3: brunet random 2 KL 4 0.4458480 0.3758906 13323.34
## 4: brunet random 3 KL 5 0.4571236 0.4168259 13081.66
## 5: brunet random 4 KL 6 0.5148980 0.4161911 12797.34
## evar silhouette.coef silhouette.basis residuals niter cpu
## 1: 0.4747579 1.0000000 1.0000000 12465.50 1030 8.576
## 2: 0.4854819 0.6882349 0.7247449 12089.89 2000 9.684
## 3: 0.4990661 0.5221031 0.6076350 11771.28 2000 11.324
## 4: 0.5081528 0.4433082 0.4436616 11463.45 1840 12.604
## 5: 0.5188427 0.3475752 0.4663420 11205.46 2000 8.160
## cpu.all nrun cophenetic dispersion silhouette.consensus
## 1: 68.896 10 0.9939517 0.9208308 0.9684131
## 2: 69.764 10 0.8822765 0.5315823 0.5727027
## 3: 82.660 10 0.8164386 0.4946355 0.3306603
## 4: 85.344 10 0.8069262 0.5570654 0.2931540
## 5: 57.840 10 0.7294776 0.5744752 0.1960407
## Using 3 as a best-fit rank based on decreasing cophenetic correlation coefficient.
## Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)
## Found Signature_1 most similar to validated Signature_3. CoSine-Similarity: 0.81484188719597
## Found Signature_2 most similar to validated Signature_3. CoSine-Similarity: 0.779060293164051
## Found Signature_3 most similar to validated Signature_1. CoSine-Similarity: 0.849041847373269
# As seen from above, the big drop of cophenetic correlation is when increasing rank from 2 to 3. The mafTools implementation is not capturing this well and hence the second task
Let’s run the extraction again with manually set rank as 2.
set.seed(1234)
sign2 <- extractSignatures(mat = tnm, n = 2)
## Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)
## Found Signature_1 most similar to validated Signature_3. CoSine-Similarity: 0.90626816098457
## Found Signature_2 most similar to validated Signature_1. CoSine-Similarity: 0.83836528166086
plotSignatures(sign2)
contribution <- as.data.frame(t(sign2$contributions))
rownames(contribution) <- gsub("\\.","-",rownames(contribution))
contribution$brcaMutated <- ifelse(rownames(contribution)%in% brcaMutated, "Mutated", "WildType")
boxplot(contribution$Signature_1 ~ contribution$brcaMutated, ylab = "Signature_1 contribution")
# t-test is not a good statistical test since the contributions are not normally distributed. Mann-Whitney test is more suitable since it's not parametric
wilcox.test(contribution$Signature_1 ~ contribution$brcaMutated)
##
## Wilcoxon rank sum test with continuity correction
##
## data: contribution$Signature_1 by contribution$brcaMutated
## W = 3827, p-value = 0.02844
## alternative hypothesis: true location shift is not equal to 0
Read the paper by Li et al. “Isolation and transcriptome analyses of human erythroid progenitors: BFU-E and CFU-E” 2014, Blood, and answer thefollowing questions.
What are authors’ main conclusions from PCA? Do you agree with them?
Figure 4 represents hierarchical clustering results. Does it agree with the PCA results? How many genes are there in the dendrogram based on the manuscript? How many genes are upregulated between CD34+ cell and BFU-E based on manuscript and supplementary table 1?
The authors performed SOM analysis. Is the analysis they did unbiased? How many clusters the SOM analysis resulted in? How did authors selected the ones shown in the manuscript? The authors say “Group 1 contained 430 genes and had dramatic downregulation at the CD34 to BFU-E transition.” Are there any other clusters that have similar “dramatic downregulation”? Check the GO results for cluster 13, cluster 9 and cluster 14. What are the differences and similarities?
“the samples were reproducible and exhibit similar transcription profiles”.
Yes, the three replicates are almost identical for CD34, BFU-E and CFU-E while there is some variation (though not big) in Pro. Note, that the genes were selected via statistical analysis so this is not unbiased. There are 1739 genes in the analysis. Manuscript says there are 399 genes upregulated, supplementary table 339.
No, the analysis is not unbiased because they used DEGs. So, this is more visualization or organizing the data rather than clustering. The authors chose cluster 13, but cluster 9 and 14 are quite close. They should have used U-matrix and not consider the nodes as “clusters”. Cluster 9 has basically the same GO terms as 13. It is not at all clear how the authors choose the clusters they show in the ms.