Introduction to Bioinformatics

Answers should be returned in PDF format by 9.00am of Wednesday 15th of February to alexandra.lahtinen@helsinki.fi

Task 1: Clustering analysis

Download the table ALL_Expression_Data.csv from Moodle. The file containts Acute Lymphoblastic Leukemia (ALL) differentially expressed genes, obtained through microarray experiments. Load the matrix into R, report the number of rows and columns and transpose the matrix, so that the rows take the place of the columns and vice versa.

Perform hierarchical clustering of the columns to see if there are groups of similar columns. You can use the default euclidean distance when computing the distance metrics. Do two different clusterings using single and complete linkage. Visualize your clustering. You may use png() and dev.off() functions to save your figures automatically. How do clusterings obtained with different linkage differ from each other? (Hint: The package cluster contains several functions to do clustering, or you may use dist and hclust functions).
Make a heatmap plot of your data. Are the data clustered in the same way as in one of your previous clustering plots? What is visualized in addition? (Hint: function heatmap)
Download the table OV_TCGA_Mutation.csv from Moodle. This table contains binary values representing whether or not a certain gene is mutated in a certain patient. The table was obtained using the Ovarian Cancer TCGA data used in exercise set 1 and a subset of 30 patient was selected for this exercise. Perform hierarchical clustering of these binary data.

Task 1: Answers

# Set working directory
setwd("~/Desktop/IntroBioinfo2017/Exercise3/")

# Load data into R
expr <- read.table("ALL_Expression_Data.csv", sep = "\t", header = T,
                   stringsAsFactors = F, row.names = 1)

# Check data frame dimension
dim(expr)

## [1] 165  33

# Transpose the data frame
expr.t <- t(expr)

In both complete and single linkage clustering the main patient division is the same, but patients belonging to the same clusters are grouped differently depending on the clustering method. This imply that depending the clustering method should be choosen carefully and according to the data that you want to explore.

# Perform complete linkage clustering and save your plot
hc.complete <- hclust(dist(expr.t, method = "euclidean"),
                      method="complete")

plot(hc.complete, xlab="ALL patients", main = "Complete linkage clustering of differentially expressed genes", sub="")

# Perform single linkage clustering and save your plot
hc.single <- hclust(dist(expr.t, method = "euclidean"),
                    method="single")
plot(hc.single, xlab="ALL patients", main = "Single linkage clustering of differentially expressed genes", sub="")

The default clustering performed by the heatmap function is identical to the complete linkage method. This is because the default clustering method in the heatmap function is “complete” linkage and the default distance metrics is “euclidian”. You can change those parameter and perform other types of clustering. The advantage of using heatmaps is that you can plot the clustered data and visually inspect the reliabily of the identified clusters.

heatmap(as.matrix(expr), xlab = "ALL Patients", ylab="Genes", main = "Heatmap of differentially expressed genes")

Clustering binary data can be achieved by selecting the appropriate distance metrics. Any hierarchical clustering method can then be applied.

# Load data
mutation.data <- read.table("OV_TCGA_Mutation.csv", sep="\t", row.names = 1,
                            header = T, stringsAsFactors = F)

# Plot binary clustering
plot(hclust(dist(t(mutation.data), method = "binary")),
     main = "Binary clustering of mutation data",
     sub = "", xlab = "Ovarian Cancer TCGA Patients")

Task 2: Dimensionality reduction

Perform Principal Component Analysis (PCA) on the ALL expression dataset and plot the first two principal components. Do you see any similarity with the hierarchical clustering? (Hint: check the package pcaMethods in Bioconductor)
Now perform and visualize Multidimentional Scaling (MDS). Compare this plot with the one obtained through PCA. (Hint: Check the cmdscale function)

Task 2: Answers

In this situation, PCA is grouping samples in 2 main clusters. The same clusters are identified by the hierarchical clustering.

# Loading pcaMethods library
library(pcaMethods)

# Selecting colors based on the complete linkage clustering
groups <- data.frame(Pos = c(hc.complete$order[1:10], hc.complete$order[11:length(hc.complete$order)]),
                     Col = c(rep("darkred", 10), rep("darkgreen", length(hc.complete$order)-10)),
                     stringsAsFactors = F)
color <- groups$Col[order(groups$Pos)]

# PCA
par(oma=c(5,5,2,0), mar=c(1,1,1,1))
slplot(pca(expr), scoresLoadings=c(FALSE, TRUE), lcol = color, lcex=1.3, lann=F)
title("PCA", outer=T, xlab="PC1", ylab = "PC2")

The points have a different spatial location compared to the PCA plot, but again the same two main clusters are identified.

# MDS
mds <- cmdscale(dist(expr.t, method = "euclidean"),eig=TRUE, k=2)

x <- mds$points[,1]
y <- mds$points[,2]

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", 
     main="MDS", type="n")
text(x, y, labels = row.names(expr.t), cex=1.3, col=color)

Task 3: Signature analysis

If you are running on a windows machine, download the precomputed trinucleotide matrix from here to your data folder, then use it for the following task. You can read the file into R using this command:

# tnm <- read.table(paste0(dataPath, "tnm.csv"),header = T, stringsAsFactors = F, row.names = 1, check.names = F, sep= "\t")

If you want to extract the trinucleotide matrix yourself and you are working on linux machine, you can download the needed hg19 reference fasta file from here and its index from here to your data folder and then extract the files.

Extract mutational signatures from mutation data, how many signatures do you find based on rank estimation and to which validated signatures are they most similar? Hint: use the prefix = ‘chr’ and add = TRUE when obtaining trinucleotideMatrix if you are extracting it yourself
Compare with the number of signatures found by Alexandrov et al in ovarian cancer, if you have different number of signatures, run the analysis again with manually selected number of signatures based on Alexandrov et al. Plot the signatures.
What is the proposed aetiology for these signatures based on Alexandrov et al., “Signatures of mutational processes in human cancer.” Nature 500.7463 (2013): 415-421?
Compare signature contribution of the relevant signature between BRCA1/2 mutated cancers and the rest, a boxplot is a good visualization option here. Is the difference statistically significant? Hint: The contribution matrix has sample barcodes with “-” changed to “.”, you need to change them back so they are comparable to the sample barcodes in brcaMutated. If you downloaded the precomputed matrix, you don’t have to worry about this.

Task 3: Answers

The rank estimation as implemeted in mafTools is suggesting 3 signatures, and they are most similar to signatures 1, 5 and 3 in Alexandrov et al. Note that this can be different from run to run. It’s normal if you had slightly different results.

# Let's get the trinucleotide matrix using the downloaded FASTA file
# tnm <- trinucleotideMatrix(maf=maf, ref_genome = paste0(dataPath,"ucsc.hg19.fasta"),
#                            prefix = 'chr', add = TRUE)
# set.seed(1234)
# sign <- extractSignatures(mat = tnm, nTry = 6)

# As seen from above, the big drop of cophenetic correlation is when increasing rank from 2 to 3. The mafTools implementation is not capturing this well and hence the second task

Alexandrov et al reported two signatures in ovarian cancer: Signature 1 and Signature 3. Let’s run the extraction again with manually set rank as 2.

# set.seed(1234)
# sign2 <- extractSignatures(mat = tnm, n = 2)
# plotSignatures(sign2)

Validated Signature 1 was associated with age at diagnosis and is a result of endogenous mutational process initiated by spontaneous deamination of 5-methylcytosine. Validated Signature 3 was associated with failure of DNA double-strand break-repair by homologous recombination.
Cancers with mutated BRCA1/2 exhibit higher contribution of validated signature 3. Let’s check that.

# contribution <- as.data.frame(t(sign2$contributions))
# rownames(contribution) <- gsub("\\.","-",rownames(contribution))
# 
# contribution$brcaMutated <- ifelse(rownames(contribution)%in% brcaMutated, "Mutated", "WildType")
# 
# boxplot(contribution$Signature_1 ~ contribution$brcaMutated, ylab = "Signature_1 contribution")
# 
# # t-test is not a good statistical test since the contributions are not normally distributed. Mann-Whitney test is more suitable since it's not parametric
# wilcox.test(contribution$Signature_1 ~ contribution$brcaMutated)

Task 4: Article

Read the paper by Li et al. “Isolation and transcriptome analyses of human erythroid progenitors: BFU-E and CFU-E” 2014, Blood, and answer thefollowing questions.

What are authors’ main conclusions from PCA? Do you agree with them?
Figure 4 represents hierarchical clustering results. Does it agree with the PCA results? How many genes are there in the dendrogram based on the manuscript? How many genes are upregulated between CD34+ cell and BFU-E based on manuscript and supplementary table 1?
The authors performed SOM analysis. Is the analysis they did unbiased? How many clusters the SOM analysis resulted in? How did authors selected the ones shown in the manuscript? The authors say “Group 1 contained 430 genes and had dramatic downregulation at the CD34 to BFU-E transition.” Are there any other clusters that have similar “dramatic downregulation”? Check the GO results for cluster 13, cluster 9 and cluster 14. What are the differences and similarities?

Task 4: Answers

“the samples were reproducible and exhibit similar transcription profiles”.
Yes, the three replicates are almost identical for CD34, BFU-E and CFU-E while there is some variation (though not big) in Pro. Note, that the genes were selected via statistical analysis so this is not unbiased. There are 1739 genes in the analysis. Manuscript says there are 399 genes upregulated, supplementary table 339.
No, the analysis is not unbiased because they used DEGs. So, this is more visualization or organizing the data rather than clustering. The authors chose cluster 13, but cluster 9 and 14 are quite close. They should have used U-matrix and not consider the nodes as “clusters”. Cluster 9 has basically the same GO terms as 13. It is not at all clear how the authors choose the clusters they show in the ms.

Introduction to Bioinformatics - Exercise 3