Lab 5 - Clustering

Name: Virginia Tang

This assignment has 14 questions and a total of 20 points.

RMarkdown

This is an RMarkdown (.Rmd) file. It is an extension of the Markdown format. Like Markdown, it can be “compiled” into a number of other formats like HTML and PDF. RMarkdown also has a special type of code block that causes the code contained within to be run in an R interpreter. The output is then embedded into the document. While it is possible to work with RMarkdown files in the command-line version of R, it is a lot easier to use the RStudio IDE.

The RStudio IDE (integrated development environment) makes it much easier to work with RMarkdown documents. This IDE enables you to compile the RMarkdown document with the click of a button (Knit). I have asked the IT staff to install RStudio on the lab computers. If you already have RStudio installed on your own laptop, then you are welcome to use that. However, this data set is rather large and older computers may have a difficult time with it.

(It is possible to run RStudio on the server and then interact with that instance in a web browser on your local computer! Unfortunately, the user account management system used for this server prevents us from using this approach.)

Open RStudio on your local computer.
- If this is the first time you’ve run RStudio on a GenLab computer, you may get a window titled “Choose R installation”. Select “default 64-bit version of R” and Ok.
- Either opt in or out of Crash Reporting.
Open this Rmd file.
- If this is the first time opening an .Rmd file in this copy of RStudio, you may get a pop-up with the prompt “Install Required Packages”.
- It took about one minute for me to install these packages on the GenLab terminal.

To run the following code block, click on the green “play” button on the right hand side of the block.

1+2  # like this

## [1] 3

If a block contains R code that generates a plot, then the resulting image is also embedded into the document. This provides a very handy way of keeping track of your analysis! Try the code block below.

plot(x=rnorm(10), y=rnorm(10))

Questions are marked with a bold number, i.e., Q0., as usual. If the question requires a text response, write your answer after the > symbol so that it appears as a blockquote.

Gene expression

A gene expression study uses some laboratory method to measure the relative number of mRNA transcripts derived from different genes in a genome. For this exercise, we’re going to use a gene expression data set that was prepared by Michael Love and Rafael Irizarry (https://github.com/genomicsclass/tissuesGeneExpression). In later lectures and labs, you’ll learn more about how to analyze gene expression data properly. For now, we’re just going to take these expression data “as is” without any additional processing.

We are going to work with a binary compressed file that contains multiple R data objects (hence the .rda extension - you will also see files with the extension .RData). The file is pretty large (30 Mb), but it will be easier to visualize the clustering analysis if we run it on our local computers, so we’re going to download it.

Open a terminal on your local computer.
Change to a directory in which you will download this file. In Linux or macOS, I suggest ~/Downloads.
Run sftp <username>@mbilabs-02.schulich.uwo.ca and enter your password.

If you are connecting remotely (outside of Genlab) then you may receive a Duo push notification. You won’t be able to connect via sftp until you confirm your identity in Duo.

Within sftp: * cd /home/shared/data * get tissuesGeneExpression.rda * exit

Next, go back to this document in RStudio and modify the file path in the following command so that it will load the .rda file that you just downloaded:

load("tissuesGeneExpression.rda")

If you are having trouble locating this file, you can use the Files panel in RStudio to browse through your filesystem. Click on the .rda file to load it into the global R workspace.

There should now be three data objects in your workspace: e, tab and tissue. To confirm this, run the command ls().

ls()

## [1] "e"      "tab"    "tissue"

Run this code block to get some information about the first object:

nrow(e)  # number of rows

## [1] 22215

ncol(e)  # number of columns

## [1] 189

str(e)  # detailed summary

##  num [1:22215, 1:189] 10.19 6.04 7.45 12.03 5.27 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:22215] "1007_s_at" "1053_at" "117_at" "121_at" ...
##   ..$ : chr [1:189] "GSM11805.CEL.gz" "GSM11814.CEL.gz" "GSM11823.CEL.gz" "GSM11830.CEL.gz" ...

The e object is a numeric matrix. Each column represents a frozen tissue specimen, from which the investigators extracted total RNA. Each row represents a probe set on an Affymetrix Gene Chip. A probe set binds to a specific bit of DNA (oligo). The more DNA that is bound by the probe set, the brighter it gets. Each probe set is labeled with a unique identifier, e.g., 1007_s_at.

A probe set corresponds to a specific mRNA transcript encoded by a gene in the human genome. If you’re curious, the table linking these probe set identifiers to genes can be viewed here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570

Q1. How many different probes were there on the gene chip used in this experiment? How many specimens? (1 point)

There are 189 specimens and 22215 gene chips

To get a sense of what these expression data look like, let’s convert e into a linear vector and run some basic commands:

vec <- as.numeric(e)  # this command "unravels" the e matrix
summary(vec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.122   5.994   7.177   7.375   8.494  14.532

hist(vec)

Let’s look at what the expression levels are like between the first two specimens:

plot(e[, 1], e[, 2], xlab="Normal 1", ylab="Cancerous 2", 
     pch=19, cex=0.2, col=rgb(0,0,0,0.3))

This plot should drive home the fact that we are dealing with a LOT of data - after all, we are only looking at two specimens out of 189! It is also apparent that expression levels are strongly correlated between specimens. This is generally true for any pair of specimens.

Metadata

The tab object is a data frame that contains metadata about each of the samples. Modify the following code block to get information about the shape and contents of this data frame.

str(tab)

## 'data.frame':    189 obs. of  6 variables:
##  $ filename     : chr  "GSM11805.CEL.gz" "GSM11814.CEL.gz" "GSM11823.CEL.gz" "GSM11830.CEL.gz" ...
##  $ DB_ID        : chr  "GSM11805" "GSM11814" "GSM11823" "GSM11830" ...
##  $ ExperimentID : chr  "GSE781" "GSE781" "GSE781" "GSE781" ...
##  $ Tissue       : chr  "kidney" "kidney" "kidney" "kidney" ...
##  $ SubType      : chr  "normal" "cancer" "normal" "cancer" ...
##  $ ClinicalGroup: chr  NA NA NA NA ...

Q2. Given what you know about the e matrix, what do the rows of the tab data frame correspond to? Explain your reasoning. (1 point)

The rows of the tab data frame correspond to individual specimens. This can be inferred because the number of rows in tab (189) is the same as the number of columns in e, which represent different tissue specimens.

We can use this information to annotate our clusters later on. To access a field (column) in the data frame, use the $ operator. For example, tab$DB_ID will extract all the database identifiers associated with these samples as a character vector.

The table() command in R entabulates a categorical variable. Here, we’re going to use this table command and the $ operator to examine the composition of these data.

table(tab$Tissue)

## 
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Modify the following command to summarize the SubType variable:

table(tab$SubType)

## 
##                                                     cancer 
##                                                          8 
##                             endometriosis, broard ligament 
##                                                          7 
## first trimester placental organ explants grown on matrigel 
##                                                          6 
##                                     hepatocellular adenoma 
##                                                          5 
##                                                       ihca 
##                                                          8 
##                                                  incipient 
##                                                          7 
##                                                   moderate 
##                                                          7 
##                                                     normal 
##                                                         85 
##                                      pilocytic astrocytoma 
##                                                          5 
##                                                     severe 
##                                                          7 
##                                                      tumor 
##                                                         34 
##                                             tumor- stage 1 
##                                                          5 
##                                             tumor- stage 2 
##                                                          5

Q3. How many of the tissue specimens are “normal”? What was the next most frequent tissue subtype (i.e., not normal)? (1 point)

85 of them are normal. The most frequent tissue subtype is tumour.

Let’s display the first few rows of the tab data frame.

head(tab)

##            filename    DB_ID ExperimentID Tissue SubType ClinicalGroup
## 288 GSM11805.CEL.gz GSM11805       GSE781 kidney  normal          <NA>
## 289 GSM11814.CEL.gz GSM11814       GSE781 kidney  cancer          <NA>
## 290 GSM11823.CEL.gz GSM11823       GSE781 kidney  normal          <NA>
## 291 GSM11830.CEL.gz GSM11830       GSE781 kidney  cancer          <NA>
## 292 GSM12067.CEL.gz GSM12067       GSE781 kidney  cancer          <NA>
## 293 GSM12075.CEL.gz GSM12075       GSE781 kidney  normal          <NA>

There is a lot of different annotations for SubType, so let’s create a new variable called Cancerous:

tab$Cancerous <- ifelse(tab$SubType=='normal', FALSE, TRUE)

Now let’s cross-tabulate the Tissue and Cancerous categories:

table(tab$Tissue, tab$Cancerous)

##              
##               FALSE TRUE
##   cerebellum     33    5
##   colon           0   34
##   endometrium     8    7
##   hippocampus    10   21
##   kidney         21   18
##   liver          13   13
##   placenta        0    6

Q4. Based on the above table, would it be possible to draw conclusions about associations between gene expression and colon cancer? Why or why not? (1 point)

It would be difficult to draw definitive conclusions because the number of cancerous and normal tissues is imbalanced. For instance, some tissue types might have more cancerous specimens than others, making it challenging to generalize the results. More balanced data would be necessary for more reliable conclusions. In addition, you cannot draw associations between gene expression and colon cancer because the table just shows metadata.

Let’s make sure that the columns of e are ordered the same way as rows in tab, because we’re going to want to link data from one object to the other:

all(names(e) == tab$filename)

## [1] TRUE

Finally, the third data object (tissue) is a vector of strings (a character vector), and it is totally redundant because the same information is contained in tab.

Q5. Use the following code block to prove the above statement. Use the previous code block as reference if you’re not sure how to begin (1 point):

all(tissue == tab$Tissue)

## [1] TRUE

# The statement evaluates to TRUE, proving that tissue is redundant because it contains the same information as the Tissue column in tab.

Calculating distances

To review, we have 189 observations of over 22,000 variables. Each variable represents the expression level associated with a specific probe on the Affymetrix Gene Chip.

To do clustering with these data, we need to calculate a pairwise distance matrix. The function for performing this in R is called dist(). Bring up the help documentation for dist by either prefixing this command name with a ?, or enclosing it in a help() command, and hit enter. (If multiple help pages come up, you want stats::dist.)

(If you are running R on the command line, then you will enter a terminal pager like less. You can scroll through the document using the arrow keys and exit by typing q.)

Q6. What is the default distance measure used by dist? What is another name for this distance? (2 points)

The default distance measure used by dist() is Euclidean distance. Another name for this distance is the 2 norm aka L2.

If we ran the dist function on e, we would get a 22,215-by-22,215 distance matrix in return. (Don’t do this!) Since we are more interested in the relationships among samples, we need to transpose the e matrix using the t() command.

Q7. Combine dist() and t() by nesting one function within the other, and assign the result of processing e to a new variable d. Enter your command below and run it. (1 point)

d <- dist(t(e))

#The distance matrix d was calculated by transposing e and applying the dist() function.

Use the summary and hist functions on d.

summary(d)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   121.7   151.9   142.7   165.1   241.1

hist(d)

Q8. What is the mean distance between samples? (1 point)

The mean distance between samples is approximately 142.7.

Visualizing the distance matrix

If we pick the first two probe sets, we can make an xy-plot (also known as a “scatterplot”) of the samples:

plot(e[1,], e[2, ], col=ifelse(tab$Cancerous, 'red', 'blue'))

Just looking at two probes seems to reveal a lot about the difference in gene expression patterns between normal and cancerous cells! However, we should be cautious about drawing conclusions from just two probes. If we scale up to all 22,215 variables, we are dealing with a cloud of points in an N-dimensional space where N=22,215.

It is impossible to arrange the 189 data points in two dimensions (i.e., on a screen) in a way that accurately preserves these distances. Put another way, imagine that you have a three-dimensional marshmallow-and-drinking straw sculpture. It will be impossible to “squish” this sculpture onto a tabletop in a way that keeps all marshmallows connected.

So, if we can’t do the impossible, we can at least try to make something that’s “good enough”. Since we’re working with a distance matrix, we can use a method known as multidimensional scaling (MDS). MDS embeds each point into a lower-dimensional space, preserving as much of the distances between points as possible. In other words, it’s a dimensionality reduction method.

There is a function in R called cmdscale that runs MDS on objects of class dist, so let’s use it here:

res <- cmdscale(d)
par(mar=c(5,5,1,1))  # override default figure margins
plot(res, xlab="Coordinate 1", ylab="Coordinate 2")

We can colour this plot with respect to whether the tissue specimen is normal or some cancerous subtype:

plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
     col=ifelse(tab$Cancerous, 'red', 'blue'))

Q9. What does the above MDS plot tell you about the distribution of specimens with respect to gene expression? Do you see clusters? What about the distribution of Cancerous labels among clusters? Provide a brief interpretation of your results: (3 points)

The MDS plot reveals clusters of specimens based on similarities and differences in gene expression patterns. Cancerous specimens (red) tend to form distinct clusters that are largely separate from normal specimens (blue). This suggests that gene expression is likely associated with cancer status, as cancerous and normal tissues show some clear separation in their expression profiles. However, there is some overlap between the two groups, indicating that while gene expression patterns do correlate with cancer status, other factors such as individual variability or technical noise might influence these clusters. Therefore, gene expression alone may not fully distinguish between cancerous and normal specimens.

Let’s colour the plot with respect to Tissue; I’m also going to use point size to represent Cancerous (large if true, small if false):

tab$Tissue <- as.factor(tab$Tissue)
pal <- hcl.colors(n=nlevels(tab$Tissue), palette="Set2", alpha=0.6)
plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
     pch=ifelse(tab$Cancerous, 22, 21), 
     bg=pal[as.integer(tab$Tissue)],
     cex=ifelse(tab$Cancerous, 1.3, 0.7))
legend(x=0, y=130, legend=levels(tab$Tissue), xjust=0.5,
       bty='n', xpd=NA, fill=pal, ncol=4)

Visualizing complex data in two dimensions is very limiting. It is possible to view three dimensions using an interactive animation in R:

require(rgl)  # you may need to run `install.packages("rgl")`

res3 <- cmdscale(d, k=3)

# if an interactive 3D plot does not appear, you may need to run:
# `options(rgl.printRglwidget = TRUE)`
plot3d(res3, col=pal[as.integer(tab$Tissue)], type='s', 
       size=ifelse(tab$Cancerous, 1, 0.5))

Q10. Inspect the interactive 3D plot by click-and-dragging the box. The colour and size scheme is the same as the previous 2D plot. What does adding a third dimension reveal about associations between gene expression, tissue type and cancer state? (2 points)

Adding a third dimension enhances the separation of clusters and helps distinguish tissue types more clearly. The third dimension reveals that the spatial separation between normal and cancerous samples is more distinct when considering the third axis, suggesting deeper associations between gene expression and tissue type.

K-means clustering

Let’s load the cluster package:

require(cluster)  # you may need to run `install.packages("cluster")`

## Loading required package: cluster

Bring up the help text for the command kmeans.

Q11. Use the following code block to run a k-means clustering analysis with 3 centers, and assign the return value to a new variable km. (NOTE kmeans should use the transposed data matrix as input x, not the distance matrix.) (1 point):

set.seed(3)  # to make results comparable among assignments
km <- kmeans(t(e), 3) # complete this command!

You can visualize the k-means clustering results using the following commands:

pal <- hcl.colors(n=nlevels(tab$Tissue), palette="Set2", alpha=0.6)
plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
     pch=ifelse(tab$Cancerous, 22, 21), 
     bg=pal[as.integer(tab$Tissue)],
     cex=ifelse(tab$Cancerous, 1.3, 0.7))
legend(x=0, y=130, legend=levels(tab$Tissue), xjust=0.5,
       bty='n', xpd=NA, fill=pal, ncol=4)

draw.cluster <- function(k, lty=1) {
  x <- res[km$cluster==k, 1]
  y <- res[km$cluster==k, 2]
  ch <- chull(x, y)
  polygon(x[ch], y[ch], lty=lty)
}
draw.cluster(1, lty=1)  # solid line
draw.cluster(2, lty=2)  # dashed line
draw.cluster(3, lty=3)  # dotted line

Q12. Do you think your analysis would benefit from fewer or more centers? Briefly discuss why, with reference to any of the plots that you’ve generated so far. (2 points)

The analysis might benefit from more centers, as there appears to be some separation within the clusters formed by cancerous and normal samples. Increasing the number of clusters might better capture the diversity within each cancerous or normal group. As seen in the 3D plot, there are more than 3 main clusters, thus supporting the idea that there needs to be more than 3 centres.

Hierarchical clustering

Finally, let’s use the hclust function to apply some different hierarchical clustering methods to these data.

Q13. Read the help text for hclust(). What is the default linkage criterion, and what does it mean? (1 point)

The default linkage criterion is complete. The complete criterion finds similar clusters.

Run this code block to view the results of hierarchical clustering under default settings:

hc <- hclust(d, "ward.D2")
plot(hc, cex=0.6, label=paste(tab$Tissue, tab$Cancerous, sep='.'))

Modify the preceding code block to use Ward’s clustering criterion, and then adjust the following code block to “cut” the hclust tree (dendrogram) at a given height (h) to retrieve cluster assignments for the 189 samples:

hclus <- cutree(hc, h=150)  # make sure you set h to some other value!
table(tab$Tissue, hclus, tab$Cancerous)

## , ,  = FALSE
## 
##              hclus
##                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
##   cerebellum   0  0  0  0 31  0  0  0  0  0  2  0  0  0  0  0  0  0
##   colon        0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   endometrium  0  0  0  0  0  0  0  0  0  0  0  0  7  1  0  0  0  0
##   hippocampus  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   kidney       1 10  0  0  0  0  0  8  0  0  2  0  0  0  0  0  0  0
##   liver        0  0  0  0  0  0  0  0  7  0  0  2  0  0  4  0  0  0
##   placenta     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## 
## , ,  = TRUE
## 
##              hclus
##                1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
##   cerebellum   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  5  0  0
##   colon        0  0  0  0  0  0 34  0  0  0  0  0  0  0  0  0  0  0
##   endometrium  0  0  0  0  0  0  0  0  0  0  0  0  0  7  0  0  0  0
##   hippocampus  0  0  8 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   kidney       8  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0
##   liver        0  0  0  0  0  0  0  0  0  5  0  0  0  0  8  0  0  0
##   placenta     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  4

Q14. Write a brief summary of the tabular results from your hierarchical clustering analysis, i.e., how your clusters map to tissue type and normal/cancerous subtypes: (3 points)

The results from the hierarchical clustering analysis show distinct groupings based on tissue types and the cancerous status of the specimens. Upon cutting the dendrogram at a specified height, the clusters reveal the following trends: Cluster Composition: Each cluster contains a mixture of normal and cancerous specimens, indicating that gene expression levels can vary significantly within each tissue type. However, certain clusters appear to predominantly contain either normal or cancerous samples, suggesting possible underlying biological differences. Tissue Types: Specific tissue types are more frequently associated with cancerous specimens, as observed in certain clusters. For example, if a cluster predominantly contains tumor samples, this might reflect a unique gene expression profile associated with that specific tissue type. Cancerous vs. Normal: The analysis indicates that while some cancerous samples cluster together, there is overlap with normal samples, which implies that gene expression profiles of normal and cancerous tissues can be quite similar, potentially complicating the identification of cancerous tissues based solely on gene expression data. Overall, the hierarchical clustering analysis provides insights into the relationships between tissue types and cancer states, highlighting the complexity of gene expression patterns across different specimens.

When you done with your assignment, click the “Knit” button to convert this file into an HTML document. Upload the resulting HTML file as an attachment in Brightspace to submit your assignment.