Introduction to Bioinformatics

Answers should be returned in PDF format by 9.00am of Wednesday 15th of February to alexandra.lahtinen@helsinki.fi

Task 1: Clustering analysis

Download the table ALL_Expression_Data.csv from Moodle. The file containts Acute Lymphoblastic Leukemia (ALL) differentially expressed genes, obtained through microarray experiments. Load the matrix into R, report the number of rows and columns and transpose the matrix, so that the rows take the place of the columns and vice versa.

Perform hierarchical clustering of the columns to see if there are groups of similar columns. You can use the default euclidean distance when computing the distance metrics. Do two different clusterings using single and complete linkage. Visualize your clustering. You may use png() and dev.off() functions to save your figures automatically. How do clusterings obtained with different linkage differ from each other? (Hint: The package cluster contains several functions to do clustering, or you may use dist and hclust functions).
Make a heatmap plot of your data. Are the data clustered in the same way as in one of your previous clustering plots? What is visualized in addition? (Hint: Function heatmap)
Download the table OV_TCGA_Mutation.csv from Moodle. This table contains binary values representing whether or not a certain gene is mutated in a certain patient. The table was obtained using the Ovarian Cancer TCGA data used in exercise set 1 and a subset of 30 patient was selected for this exercise. Perform hierarchical clustering of these binary data.

Task 2: Dimensionality reduction

Perform Principal Component Analysis (PCA) on the ALL expression dataset and plot the first two principal components. Do you see any similarity with the hierarchical clustering? (Hint: check the package pcaMethods in Bioconductor)
Now perform and visualize Multidimentional Scaling (MDS). Compare this plot with the one obtained through PCA. (Hint: Check the cmdscale function)

Task 3: Signature analysis

To complete this task you will need to read the paper by Alexandrov et al., “Signatures of mutational processes in human cancer”, Nature, 2013. You can find the paper in Moodle. In this task we are also using the NMF and maftools packages and the Ovarian Cancer TCGA mutation data we analyzed in exercise set 1. Read the allSamplesMaf.maf file you created in exercise set 1 and create a maf object. If you are running on a windows machine, download the precomputed trinucleotide matrix from here to your data folder, then use it for the following task. You can read the file into R using this command:

dataPath <- "~/Desktop/IntroBioinfo2017/Exercise3"

tnm <- read.table(file.path(dataPath, "tnm.csv"),header = T, stringsAsFactors = F, row.names = 1, check.names = F, sep= "\t")

If you want to extract the trinucleotide matrix yourself and you are working on linux machine, you can download the needed hg19 reference fasta file from here and its index from here to your data folder and then extract the files.

You will also need to use the TCGA barcodes of patients having either BRCA1 or BRCA2 mutations, which you have identified in exercise set 1.

brcaMutated <- c("TCGA-04-1331-01", "TCGA-04-1357-01", "TCGA-09-2050-01", "TCGA-13-0730-01", "TCGA-13-0804-01", "TCGA-13-0885-01", "TCGA-13-0890-01", "TCGA-13-1481-01", "TCGA-13-1489-01", "TCGA-23-1026-01", "TCGA-23-1030-01", "TCGA-24-1103-01", "TCGA-24-1555-01", "TCGA-24-2035-01", "TCGA-25-1625-01", "TCGA-25-1630-01", "TCGA-25-1632-01", "TCGA-29-2427-01", "TCGA-13-0726-01", "TCGA-24-1470-01")

Extract mutational signatures from mutation data, how many signatures do you find based on rank estimation and to which validated signatures are they most similar? (Hint: use the prefix = ‘chr’ and add = TRUE when obtaining trinucleotideMatrix if you are extracting it yourself).
Alexandrov et al. found two mutational signatures to be active in ovarian cancer, what is the proposed aetiology for each of these two signatures? If you found a different number of signatures in task (a), repeat the extraction with manually specified rank =2.
One of these two signatures has been associated with BRCA1 and BRCA2 mutations. Compare the contribution of that signature between BRCA1/2 mutated cancers and the rest. A boxplot is a good visualization option here. Is the difference statistically significant? (Hint: The contribution matrix has sample barcodes with “-” changed to “.”, you need to change them back so they are comparable to the sample barcodes in brcaMutated. If you downloaded the precomputed matrix, you don’t have to worry about this).

Task 4: Article

Read the paper by Li et al. “Isolation and transcriptome analyses of human erythroid progenitors: BFU-E and CFU-E”, 2014, Blood, available in Moodle and answer the following questions.

What are authors’ main conclusions from PCA? Do you agree with them?
Figure 4 represents hierarchical clustering results. Does it agree with the PCA results? How many genes are there in the dendrogram based on the manuscript? How many genes are upregulated between CD34+ cell and BFU-E based on manuscript and supplementary table 1?
The authors performed SOM analysis. Is the analysis they did unbiased? How many clusters the SOM analysis resulted in? How did authors selected the ones shown in the manuscript? The authors say “Group 1 contained 430 genes and had dramatic downregulation at the CD34 to BFU-E transition.” Are there any other clusters that have similar “dramatic downregulation”? Check the GO results for cluster 13, cluster 9 and cluster 14. What are the differences and similarities?

Introduction to Bioinformatics - Exercise 3

Task 1: Clustering analysis

Task 2: Dimensionality reduction

Task 3: Signature analysis

Task 4: Article