Introduction to Bioinformatics

Answers should be returned in PDF format by 9.00am of Wednesday 15th of February to alexandra.lahtinen@helsinki.fi

Task 1: Clustering analysis

Download the table ALL_Expression_Data.csv from Moodle. The file containts Acute Lymphoblastic Leukemia (ALL) differentially expressed genes, obtained through microarray experiments. Load the matrix into R, report the number of rows and columns and transpose the matrix, so that the rows take the place of the columns and vice versa.

Perform hierarchical clustering of the columns to see if there are groups of similar columns. You can use the default euclidean distance when computing the distance metrics. Do two different clusterings using single and complete linkage. Visualize your clustering. You may use png() and dev.off() functions to save your figures automatically. How do clusterings obtained with different linkage differ from each other? (Hint: The package cluster contains several functions to do clustering, or you may use dist and hclust functions).
Make a heatmap plot of your data. Are the data clustered in the same way as in one of your previous clustering plots? What is visualized in addition? (Hint: function heatmap)
Download the table OV_TCGA_Mutation.csv from Moodle. This table contains binary values representing whether or not a certain gene is mutated in a certain patient. The table was obtained using the Ovarian Cancer TCGA data used in exercise set 1 and a subset of 30 patient was selected for this exercise. Perform hierarchical clustering of these binary data.

Task 1: Answers

# Set working directory
setwd("~/Desktop/IntroBioinfo2017/Exercise3/")

# Load data into R
expr <- read.table("ALL_Expression_Data.csv", sep = "\t", header = T,
                   stringsAsFactors = F, row.names = 1)

# Check data frame dimension
dim(expr)

## [1] 165  33

# Transpose the data frame
expr.t <- t(expr)

In both complete and single linkage clustering the main patient division is the same, but patients belonging to the same clusters are grouped differently depending on the clustering method. This imply that depending the clustering method should be choosen carefully and according to the data that you want to explore.

# Perform complete linkage clustering and save your plot
hc.complete <- hclust(dist(expr.t, method = "euclidean"),
                      method="complete")

plot(hc.complete, xlab="ALL patients", main = "Complete linkage clustering of differentially expressed genes", sub="")

# Perform single linkage clustering and save your plot
hc.single <- hclust(dist(expr.t, method = "euclidean"),
                    method="single")
plot(hc.single, xlab="ALL patients", main = "Single linkage clustering of differentially expressed genes", sub="")

The default clustering performed by the heatmap function is identical to the complete linkage method. This is because the default clustering method in the heatmap function is “complete” linkage and the default distance metrics is “euclidian”. You can change those parameter and perform other types of clustering. The advantage of using heatmaps is that you can plot the clustered data and visually inspect the reliabily of the identified clusters.

heatmap(as.matrix(expr), xlab = "ALL Patients", ylab="Genes", main = "Heatmap of differentially expressed genes")

Clustering binary data can be achieved by selecting the appropriate distance metrics. Any hierarchical clustering method can then be applied.

# Load data
mutation.data <- read.table("OV_TCGA_Mutation.csv", sep="\t", row.names = 1,
                            header = T, stringsAsFactors = F)

# Plot binary clustering
plot(hclust(dist(t(mutation.data), method = "binary")),
     main = "Binary clustering of mutation data",
     sub = "", xlab = "Ovarian Cancer TCGA Patients")

Task 2: Dimensionality reduction

Perform Principal Component Analysis (PCA) on the ALL expression dataset and plot the first two principal components. Do you see any similarity with the hierarchical clustering? (Hint: check the package pcaMethods in Bioconductor)
Now perform and visualize Multidimentional Scaling (MDS). Compare this plot with the one obtained through PCA. (Hint: Check the cmdscale function)

Task 2: Answers

In this situation, PCA is grouping samples in 2 main clusters. The same clusters are identified by the hierarchical clustering.

# Loading pcaMethods library
library(pcaMethods)

# Selecting colors based on the complete linkage clustering
groups <- data.frame(Pos = c(hc.complete$order[1:10], hc.complete$order[11:length(hc.complete$order)]),
                     Col = c(rep("darkred", 10), rep("darkgreen", length(hc.complete$order)-10)),
                     stringsAsFactors = F)
color <- groups$Col[order(groups$Pos)]

# PCA
par(oma=c(5,5,2,0), mar=c(1,1,1,1))
slplot(pca(expr), scoresLoadings=c(FALSE, TRUE), lcol = color, lcex=1.3, lann=F)
title("PCA", outer=T, xlab="PC1", ylab = "PC2")

The points have a different spatial location compared to the PCA plot, but again the same two main clusters are identified.

# MDS
mds <- cmdscale(dist(expr.t, method = "euclidean"),eig=TRUE, k=2)

x <- mds$points[,1]
y <- mds$points[,2]

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", 
     main="MDS", type="n")
text(x, y, labels = row.names(expr.t), cex=1.3, col=color)

Task 3: Signature analysis

To complete this task you will need to read the paper by Alexandrov et al., “Signatures of mutational processes in human cancer”, Nature, 2013. You can find the paper in Moodle. In this task we are also using the NMF and maftools packages and the Ovarian Cancer TCGA mutation data we analyzed in exercise set 1. Read the allSamplesMaf.maf file you created in exercise set 1 and create a maf object. If you are running on a windows machine, download the precomputed trinucleotide matrix from here to your data folder, then use it for the following task. You can read the file into R using this command:

dataPath <- "~/Desktop/IntroBioinfo2017/Exercise3"

tnm <- read.table(file.path(dataPath, "tnm.csv"),header = T, stringsAsFactors = F, row.names = 1, check.names = F, sep= "\t")

If you want to extract the trinucleotide matrix yourself and you are working on linux machine, you can download the needed hg19 reference fasta file from here and its index from here to your data folder and then extract the files.

You will also need to use the TCGA barcodes of patients having either BRCA1 or BRCA2 mutations, which you have identified in exercise set 1.

brcaMutated <- c("TCGA-04-1331-01", "TCGA-04-1357-01", "TCGA-09-2050-01", "TCGA-13-0730-01", "TCGA-13-0804-01", "TCGA-13-0885-01", "TCGA-13-0890-01", "TCGA-13-1481-01", "TCGA-13-1489-01", "TCGA-23-1026-01", "TCGA-23-1030-01", "TCGA-24-1103-01", "TCGA-24-1555-01", "TCGA-24-2035-01", "TCGA-25-1625-01", "TCGA-25-1630-01", "TCGA-25-1632-01", "TCGA-29-2427-01", "TCGA-13-0726-01", "TCGA-24-1470-01")

Extract mutational signatures from mutation data, how many signatures do you find based on rank estimation and to which validated signatures are they most similar? (Hint: use the prefix = ‘chr’ and add = TRUE when obtaining trinucleotideMatrix if you are extracting it yourself).
Alexandrov et al. found two mutational signatures to be active in ovarian cancer, what is the proposed aetiology for each of these two signatures? If you found a different number of signatures in task (a), repeat the extraction with manually specified rank =2.
One of these two signatures has been associated with BRCA1 and BRCA2 mutations. Compare the contribution of that signature between BRCA1/2 mutated cancers and the rest. A boxplot is a good visualization option here. Is the difference statistically significant? (Hint: The contribution matrix has sample barcodes with “-” changed to “.”, you need to change them back so they are comparable to the sample barcodes in brcaMutated. If you downloaded the precomputed matrix, you don’t have to worry about this).

Task 3: Answers

The rank estimation as implemeted in mafTools is suggesting 3 signatures, and they are most similar to signatures 1, 5 and 3 in Alexandrov et al. Note that this can be different from run to run. It’s normal if you had slightly different results.

library(NMF)
library(maftools)

# If you have a linux computer you can run the following lines
# dataPath <- "~/Desktop/IntroBioinfo2017/Exercise3"
# maf <- read.maf(maf = "~/Desktop/IntroBioinfo2017/Exercise1/allSamplesMaf.maf",
#                 removeSilent = T, useAll = T , verbose = F)

# Let's get the trinucleotide matrix using the downloaded FASTA file
# tnm <- trinucleotideMatrix(maf=maf, ref_genome = file.path(dataPath,"ucsc.hg19.fasta"),
#                            prefix = 'chr', add = TRUE)

set.seed(1234)
sign <- extractSignatures(mat = tnm, nTry = 6)

## Estimating best rank..

##    method   seed rng metric rank sparseness.basis sparseness.coef      rss
## 1: brunet random   1     KL    2        0.3324409       0.3152792 13969.86
## 2: brunet random   1     KL    3        0.3720326       0.3364052 13684.64
## 3: brunet random   2     KL    4        0.4458480       0.3758906 13323.34
## 4: brunet random   3     KL    5        0.4571236       0.4168259 13081.66
## 5: brunet random   4     KL    6        0.5148980       0.4161911 12797.34
##         evar silhouette.coef silhouette.basis residuals niter    cpu
## 1: 0.4747579       1.0000000        1.0000000  12465.50  1030  8.576
## 2: 0.4854819       0.6882349        0.7247449  12089.89  2000  9.684
## 3: 0.4990661       0.5221031        0.6076350  11771.28  2000 11.324
## 4: 0.5081528       0.4433082        0.4436616  11463.45  1840 12.604
## 5: 0.5188427       0.3475752        0.4663420  11205.46  2000  8.160
##    cpu.all nrun cophenetic dispersion silhouette.consensus
## 1:  68.896   10  0.9939517  0.9208308            0.9684131
## 2:  69.764   10  0.8822765  0.5315823            0.5727027
## 3:  82.660   10  0.8164386  0.4946355            0.3306603
## 4:  85.344   10  0.8069262  0.5570654            0.2931540
## 5:  57.840   10  0.7294776  0.5744752            0.1960407

## Using 3 as a best-fit rank based on decreasing cophenetic correlation coefficient.

## Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)

## Found Signature_1 most similar to validated Signature_3. CoSine-Similarity: 0.81484188719597

## Found Signature_2 most similar to validated Signature_3. CoSine-Similarity: 0.779060293164051

## Found Signature_3 most similar to validated Signature_1. CoSine-Similarity: 0.849041847373269

# As seen from above, the big drop of cophenetic correlation is when increasing rank from 2 to 3. The mafTools implementation is not capturing this well and hence the second task

Alexandrov et al. reported two signatures in ovarian cancer: Signature 1 and Signature 3. Validated Signature 1 was associated with age at diagnosis and is a result of endogenous mutational process initiated by spontaneous deamination of 5-methylcytosine. Validated Signature 3 was associated with failure of DNA double-strand break-repair by homologous recombination.

Let’s run the extraction again with manually set rank as 2.

set.seed(1234)
sign2 <- extractSignatures(mat = tnm, n = 2)

## Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)

## Found Signature_1 most similar to validated Signature_3. CoSine-Similarity: 0.90626816098457

## Found Signature_2 most similar to validated Signature_1. CoSine-Similarity: 0.83836528166086

plotSignatures(sign2)

Cancers with mutated BRCA1/2 exhibit higher contribution of validated signature 3. Let’s check that.

contribution <- as.data.frame(t(sign2$contributions))
rownames(contribution) <- gsub("\\.","-",rownames(contribution))

contribution$brcaMutated <- ifelse(rownames(contribution)%in% brcaMutated, "Mutated", "WildType")

boxplot(contribution$Signature_1 ~ contribution$brcaMutated, ylab = "Signature_1 contribution")

# t-test is not a good statistical test since the contributions are not normally distributed. Mann-Whitney test  is more suitable since it's not parametric
wilcox.test(contribution$Signature_1 ~ contribution$brcaMutated)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  contribution$Signature_1 by contribution$brcaMutated
## W = 3827, p-value = 0.02844
## alternative hypothesis: true location shift is not equal to 0

Task 4: Article

Read the paper by Li et al. “Isolation and transcriptome analyses of human erythroid progenitors: BFU-E and CFU-E” 2014, Blood, and answer thefollowing questions.

What are authors’ main conclusions from PCA? Do you agree with them?
Figure 4 represents hierarchical clustering results. Does it agree with the PCA results? How many genes are there in the dendrogram based on the manuscript? How many genes are upregulated between CD34+ cell and BFU-E based on manuscript and supplementary table 1?
The authors performed SOM analysis. Is the analysis they did unbiased? How many clusters the SOM analysis resulted in? How did authors selected the ones shown in the manuscript? The authors say “Group 1 contained 430 genes and had dramatic downregulation at the CD34 to BFU-E transition.” Are there any other clusters that have similar “dramatic downregulation”? Check the GO results for cluster 13, cluster 9 and cluster 14. What are the differences and similarities?

Task 4: Answers

“the samples were reproducible and exhibit similar transcription profiles”.
Yes, the three replicates are almost identical for CD34, BFU-E and CFU-E while there is some variation (though not big) in Pro. Note, that the genes were selected via statistical analysis so this is not unbiased. There are 1739 genes in the analysis. Manuscript says there are 399 genes upregulated, supplementary table 339.
No, the analysis is not unbiased because they used DEGs. So, this is more visualization or organizing the data rather than clustering. The authors chose cluster 13, but cluster 9 and 14 are quite close. They should have used U-matrix and not consider the nodes as “clusters”. Cluster 9 has basically the same GO terms as 13. It is not at all clear how the authors choose the clusters they show in the ms.

Introduction to Bioinformatics - Exercise 3