The goal of this weeks class is to:
Identify patterns in datasets using scatter plots, parallel plots and heatmaps to assess which variables in samples differentiate them from other samples in the dataset.
Study how computers define similarity among samples in datasets, and group samples into clusters according to similar variables.
See below for a toy dataset in which 3 genes have been measured in 4 patients using RNA-Seq. The benefit of the toy dataset is we can perform some of the calculations ourselves using pen and paper as a proof of concept before moving on to larger datasets later in the lesson.
df=data.frame(
IRX4=c(11,13,2,1),
OCT4=c(10,13,4,3 ),
PAX6=c(1 ,3 ,10,9),
row.names=c("patient1","patient2","patient3","patient4"))
print(df)## IRX4 OCT4 PAX6
## patient1 11 10 1
## patient2 13 13 3
## patient3 2 4 10
## patient4 1 3 9
Let’s make a barplot of the dataset to help visualise the gene expression patterns:
df2=tidyr::gather(cbind(patient=rownames(df),df),key="gene",value="expression",IRX4,PAX6,OCT4)
ggbarplot(df2, x="gene", y="expression", fill="patient", facet.by = "patient")Answer the following questions:
Which genes are up-regulated in patients 1 & 2?
Which gene is up-regulated in patients 3 & 4?
Conceptually, we could place patients 1 & 2 in a group and patients 3 & 4 in their own group based on how similar their genes are.
We can use a scatter plot matrix (SPLOM) to plot each variable (gene) in the dataset. Each plot in the matrix has 2 inputs, a gene on the x-axis and a gene on the y-axis of the plot, producing x and y coordinates for the sample to be plotted.
# create factor vector of each patient (used for colors)
id <- as.factor(rownames(df))
group <- as.factor(c(15, 15, 20, 20))
ann <- data.frame(ID=id,
Group=group)
# Base R plotting
pairs(df, col=id, lower.panel = NULL, cex = 2, pch = 20)
par(xpd = TRUE)
legend(x = 0.05, y = 0.4, cex = 1, legend = ann$ID, fill=unique(ann$ID))par(xpd = NA)Notice how patients 1 & 2 (red and black dots) are close together in 2D space. The same can be said for patients 3 & 4. This plot provides a quick visualisation showing that some samples share similar gene expression profiles.
It is important that you can appreciate that this plot is generated from the toy dataset dataframe previously shown. See below for a guide on how to read how each datapoint was plotted by calculating the input (x,y) coordinates:
(Recall the x-axis is the horizontal, the y-axis is the vertical)
Figure: Interpreting a SPLOM
Now try it yourself:
Patient 3 has the following gene expression values: OCT4: _4_ & PAX6: _10_.
Patient 1 has the following gene expression values: IRX4: ___ & PAX6: ___.
Patient 2 has the following gene expression values: OCT4: ___ & IRX4: ___.
Patient 4 has the following gene expression values: PAX6: ___ & OCT4: ___.
Double check your answers against the original dataframe used to make the plot:
print(df)## IRX4 OCT4 PAX6
## patient1 11 10 1
## patient2 13 13 3
## patient3 2 4 10
## patient4 1 3 9
More importantly, the plot shows which variables (genes) are capable of stratifying the patients. In simpler terms, we can say that PAX6 expression can be used to split the patients into 2 groups - high and low expression. The same can be said for every gene in this toy dataset.
Here is a visualisation of what I mean:
plotfun <- function(x,y,...){
points(x,y,...)
abline(h=7)
abline(v=7)
}
# Base R plotting
pairs(df, col=id, lower.panel = NULL, cex = 2, pch = 20, upper.panel = plotfun, xlim=c(0,14), ylim=c(0,14))
par(xpd = TRUE)
legend(x = 0.05, y = 0.4, cex = 1, legend = as.character(levels(id)), fill=unique(id))par(xpd = NA)I have added a vertical and horizontal line through the center of each plot to reinforce what you should be looking for in a scatter plot matrix - can we draw a line that separates groups in any of the plots?
The answer in this toy dataset is yes - every gene is informative - they can all be used to split the patients into ‘high’ and ‘low’ expression groups.
Parallel plots are used to plot each samples gene expression profile in 1-D space.
What do I mean by 2-D and 1-D space?
1-D, or one dimension, is the same as a numberline we all used to learn how to count. The samples value for a variable will fall somewhere on the line.
2-D or two dimensions, uses as input two variables to produce an x,y coordinate to make a dot - just like the SPLOM.
# Use parcoord() in the MASS package
MASS::parcoord(df, col = id, var.label = TRUE, lwd = 2)
# Add a legend
par(xpd = TRUE)
legend(x = 1.1, y = 1.175, cex = 1,
legend = as.character(levels(id)),
fill = unique(id), horiz = TRUE)In the parallel plot we can see that at each variable (gene), patients 1 & 2 are close together, whilst patients 3 & 4 are further away. If we pretend that patient 1 & 2 belong to Group1 and patients 3 & 4 belong to Group2, we could say that each gene is capable of stratifying each group.
The human eye is extremely efficient at recognising patterns in plots, which I have demonstrated using both scatter plots and parallel plots.
But how does a computer ‘see’ these patterns? It does not have eyes so it must use a different method to define how similar (or dissimilar) samples are!
In this section we will cover the Manhattan Distance - a mathematical method to determine how similar two samples are, based on their input variables (i.e each gene value).
Figure: Manhattan Distance formula
The equation reads:
The manhattan distance between
x and yis the sum of the absolute values ofx - y.
Here, x and y could be patient 1 & patient 2 - or whatever two samples you are comparing.
The sum simply means add up every value
The absolute value means when adding up the sum, do not pay attention to negative signs. i.e -6 becomes +6
See below for an example done by hand:
Figure: Calculate Manhattan Distance by hand
Make an attempt to calculate the manhattan distance between patient 1 & 3, and also patient 1 & 4.
You will notice that when compared to patient 3 and 4, the manhattan distance is larger. This means the samples are dissimilar.
However patient 1 and patient 2 are more similar, as the manhattan distance (7) is smaller compared to the other two manhattan distances.
This is how computers can tell how similar, or dissimilar samples are.
dist() FunctionLuckily in R we can use the dist() function to calculate the manhattan distance for all samples in a matrix. We should never have to perform calculations by hand - but it is good practice to see how the results are generated.
dist(df, method="manhattan", diag = T, upper = T)## patient1 patient2 patient3 patient4
## patient1 0 7 24 25
## patient2 7 0 27 28
## patient3 24 27 0 3
## patient4 25 28 3 0
When samples are compared to itself, (i.e patient 1 vs. patient 1) the manhattan distance is 0 as the samples are identical. This is why we have 0 values along the diagonal of the output. The output is also symmetrical, so we could just print the result like so:
dist(df, method="manhattan", diag=T)## patient1 patient2 patient3 patient4
## patient1 0
## patient2 7 0
## patient3 24 27 0
## patient4 25 28 3 0
Sample heatmaps are used to plot the results of the computed manhattan distances. Plots are more interpretable than a table of digits!
# use Manhattan distance (store in matrix)
d=as.matrix(dist(df, method="manhattan"))
# add patient ID to rows & columns for the heatmap
rownames(d) <- rownames(df)
colnames(d) <- rownames(df)
# Plot the distance matrix using a heatmap
pheatmap(d, cluster_rows = F, cluster_cols = F,
show_rownames = T, show_colnames = T)If how the plot was made is not immediately obvious, let’s add the manhattan distance numbers to the plot so you can see exactly how the colors were generated.
The colors plotted in each cell are created according to the color legend on the right hand side of the plot.
pheatmap(d, cluster_rows = F, cluster_cols = F,
show_rownames = T, show_colnames = T, display_numbers = TRUE)The next step is to use a clustering algorithm called hierarchical clustering to place each sample into a cluster according to how similar samples are. The algorith works by:
Calculating the distance between all samples.
Joining the two ‘closest’ (smallest distance) samples together to form the first cluster.
Re-calculate distances between all samples (and the new cluster) and repeat the process until every sample has been added to a cluster.
See the GIF below for a visual example of this:
Figure: Calculate Manhattan Distance by hand
How can we visualise this GIF in a plot? By using a dendogram which looks like a tree with branches. Each branch represents a cluster, until you work all the way down to the bottom in which case each branch represents a sample:
Figure: Calculate Manhattan Distance by hand
Let’s add a dendogram to our sample heatmap:
# Plot the distance matrix using a heatmap
pheatmap(d, cluster_rows = T, cluster_cols = T,
show_rownames = T, show_colnames = T,
treeheight_row = 100, treeheight_col = 100)Great! We can clearly see that the algorithm thinks there are two clusters in our toy dataset based on the fact the tree has two branches, each containing 2 patients.
Feature heatmaps plot both the samples and variables, allowing us to quickly see patterns in gene expression in each sample. The benefit of feature heatmaps vs. SPLOMS/parallel plots is that we can ask R to perform clustering on the variables, revealing patterns in the data.
clust <- c("Group1", "Group1", "Group2", "Group2")
df <- as.data.frame(t(df))
annotdf = data.frame(row.names = colnames(df),
Group = clust)
pheatmap(df, annotation_col = annotdf,
cluster_cols = TRUE,
cluster_rows = TRUE,
treeheight_row = 100, treeheight_col = 100,
display_numbers = TRUE)The dendogram on the top of the plot clusters our patients (I have added colors to code them according to Group1 or Group 2), but what about the dendogram on the row?
The row-side dendogram shows 2 clusters - 1 containing PAX6 and 1 containing IRX4 and OCT4. This is because IRX4 and OCT4 show the inverse expression to PAX6. That is to say, when patients in Group 1 show high IRX4 and OCT4 expression, their PAX6 expression is low.
We will use the Iris dataset to produce the same plots in the previous section.
The Iris dataset contains measurements of flowers sepal.length, sepal.width, petal.width and petal.length. It also includes information of which flower species the sample belongs to, one of setosa, virginica and versicolor.
The flower measurements are continuous variables - something we can measure. When we look at gene expression datasets later we are measuring how much a gene is expressed. It is important to grasp this concept because the methods we apply here are directly transferable to many different fields of study - we just substitute whatever we are interested in measuring to classify samples.
First things first, lets plot the results of the manhattan distance for all 150 samples in the dataset:
data("iris")
# Calculate distance metrics.
dmat <- dist(iris[,1:4])
# This dataframe links each sample name to its flower species
annotdf <- data.frame(row.names = rownames(as.matrix(dmat)),
Species = iris[,5] )
# Must make colors so we can see which group a sample belongs to
ann_col <- list(Species = c(setosa = "darkorange2",
versicolor = "forestgreen",
virginica = "black"))
# plot heatmap
pheatmap(as.matrix(dmat),
clustering_distance_rows = dmat,
clustering_distance_cols = dmat,
color = greenred(200),
annotation_col = annotdf,
annotation_row = annotdf,
annotation_colors = ann_col,
show_rownames = F, show_colnames = F,
treeheight_row = 100, treeheight_col = 100)Pay close attention to the dendogram (the top and side dendogram are the same in sample heatmaps).
We know that there are 3 flower species, so we want to see how well each sample was placed into the first 3 clusters. We can identify the first 3 clusters by drawing a line across the tree in such a manner that it intersects 3 lines - each representing a cluster:
Figure: Cutting a dendogram tree
Let’s plot what that would look like in our heatmap:
# plot heatmap
pheatmap(as.matrix(dmat),
clustering_distance_rows = dmat,
clustering_distance_cols = dmat,
color = greenred(200),
annotation_col = annotdf,
annotation_row = annotdf,
annotation_colors = ann_col,
show_rownames = F, show_colnames = F,
treeheight_row = 100, treeheight_col = 100,
cutree_cols = 3)The clustering algorithm did a great job of placing every Setosa flower in the correct cluster - it is one solid orange block.
But we can see that the other two clusters contain a mix of both versicolor and virginica samples (green and black species labels).
Why? Think about the inputs used to calculate the manhattan distance (the samples variables i.e sepal.length, sepal.width, petal.width and petal.length). We can say that the virginica and versicolor samples must share similar features (variables) which makes them hard to distinguish.
We already know from the sample heatmap that virginica and versicolor samples are hard to differentiate.
Let’s make a SPLOM to find out why:
# SPLOM plot
# Base R plotting
pairs(iris[,1:4], col=iris[,5], lower.panel = NULL, cex = 2, pch = 20)
par(xpd = TRUE)
legend(x = 0.05, y = 0.4, cex = 1, legend = as.character(levels(iris[,5])), fill=unique(iris[,5]))par(xpd = NA)Instantly obvious is the fact that the Setosa samples (black points) separate from the other two species really well in both petal.width and petal.length.
See the plot below for a visualisation of what I mean:
plotfun <- function(x,y,...){
points(x,y,...)
abline(v=0.75)
}
# SPLOM plot
# Base R plotting
pairs(iris[,1:4], col=iris[,5], lower.panel = NULL, upper.panel = plotfun, cex = 2, pch = 20, verInd = 4)In fact, the petal.width variable could be used on its own to classify each flower species!
Small petal width = Setosa
Medium petal width = Versicolor
Large petal width = Virginica
See the plot below:
plotfun <- function(x,y,...){
points(x,y,...)
abline(v=c(0.75, 1.75))
}
# SPLOM plot
# Base R plotting
pairs(iris[,1:4], col=iris[,5], lower.panel = NULL, upper.panel = plotfun, cex = 2, pch = 20, verInd = 4) ***
There are some overlapping samples, but in my opinion this is the best variable in the dataset that could be used to correctly classify the flower species.
Let’s inspect the same information in 1 dimensional space using a parallel plot:
# Use parcoord() in the MASS package
MASS::parcoord(iris[,1:4], col = iris[,5], var.label = TRUE, lwd = 2)
# Add a legend
par(xpd = TRUE)
legend(x = 1.2, y = 1.175, cex = 1,
legend = as.character(levels(iris[,5])),
fill = unique(iris[,5]), horiz = TRUE)The same information should be obvious:
The variable sepal.length does not do a great job at separating each flower species. We can argue that Setosa species show shorter sepal.lengths than versicolor/virginica samples, so the variable is somewhat useful.
The variable sepal.width does not do a great job at separating each flower species either. We can argue that Setosa species have wider sepal.widths than versicolor/virginica samples, so the variable is somewhat useful.
The variables petal.width and petal.length are pretty good variables - we can see that each species can be divided into small/medium/large values for these two variables.
data("iris")
# Isolate the categorical variable as a vector
species <- iris[,5]
# isolate rownames as a vector
id <- rownames(iris)
# Transpose the continuous variables in the matrix and store 'as.data.frame'
iris <- as.data.frame(t(iris[,1:4]))
# Add back the ID of each sample (columns not rows after transpose)
colnames(iris) <- id
# As before, we are linking the ID's to flower species
annotdf <- data.frame(row.names = colnames(iris),
Species = species)
# Add colors because default colors are terrible
ann_col <- list(Species = c(setosa = "darkorange2",
versicolor = "forestgreen",
virginica = "black"))
# IMPORTANT!!!
# We are using a dendogram generated from euclidean distance
dmat <- dist(as.matrix(t(iris)))
# Make the heatmap
pheatmap(iris,
clustering_distance_cols = dmat, cluster_rows = TRUE,
annotation_col = annotdf, annotation_colors = ann_col,
color = greenred(100), show_colnames = FALSE, show_rownames = TRUE,
scale = "row",
treeheight_row = 100, treeheight_col = 100, cutree_cols = 3)Interpreting the plot is a little more complicated, but it validates the statements we have been previously making:
petal.width and petal.length rows. It is bright red for most of the virginica species (larger value), closer to black for versicolor (medium value) and bright green for setosa (small value). Recall the colors are determined by the color scale on the right hand side of the plot (these values have been scaled to center around 0 - this is beyond the scope of this course).Go to the webpage and answer the questions about the plots in section 3: Worksheet.
The main thing I want you to be able to do is interpret the plots in the same manner as the Iris dataset. Instead of sepal and petal measurements, the dataset consists of 5 genes taken from 3 types of cancer patients: BRCA - breast cancer, LUSC - lung cancer and OV - ovarian cancer.
See for youself how well the clustering algorithm places patients into the correct cluster - and if it does not, examine the gene expression plots in the dataset to find out why.