Name: Virginia Tang
This assignment has 14 questions and a total of 20 points.
This is an RMarkdown (.Rmd) file. It is an extension of
the Markdown format. Like Markdown, it can be “compiled” into a number
of other formats like HTML and PDF. RMarkdown also has a special type of
code block that causes the code contained within to be run in an R
interpreter. The output is then embedded into the document. While it is
possible to work with RMarkdown files in the command-line version of R,
it is a lot easier to use the RStudio IDE.
The RStudio IDE (integrated development environment) makes it much
easier to work with RMarkdown documents. This IDE enables you to compile
the RMarkdown document with the click of a button (Knit). I
have asked the IT staff to install RStudio on the lab computers. If you
already have RStudio installed on your own laptop, then you are welcome
to use that. However, this data set is rather large and older computers
may have a difficult time with it.
(It is possible to run RStudio on the server and then interact with that instance in a web browser on your local computer! Unfortunately, the user account management system used for this server prevents us from using this approach.)
.Rmd file in this
copy of RStudio, you may get a pop-up with the prompt “Install Required
Packages”.To run the following code block, click on the green “play” button on the right hand side of the block.
1+2 # like this
## [1] 3
If a block contains R code that generates a plot, then the resulting image is also embedded into the document. This provides a very handy way of keeping track of your analysis! Try the code block below.
plot(x=rnorm(10), y=rnorm(10))
Questions are marked with a bold number, i.e., Q0.,
as usual. If the question requires a text response, write your answer
after the > symbol so that it appears as a
blockquote.
A gene expression study uses some laboratory method to measure the relative number of mRNA transcripts derived from different genes in a genome. For this exercise, we’re going to use a gene expression data set that was prepared by Michael Love and Rafael Irizarry (https://github.com/genomicsclass/tissuesGeneExpression). In later lectures and labs, you’ll learn more about how to analyze gene expression data properly. For now, we’re just going to take these expression data “as is” without any additional processing.
We are going to work with a binary compressed file that contains
multiple R data objects (hence the .rda extension - you
will also see files with the extension .RData). The file is
pretty large (30 Mb), but it will be easier to visualize the clustering
analysis if we run it on our local computers, so we’re going to download
it.
~/Downloads.sftp <username>@mbilabs-02.schulich.uwo.ca
and enter your password.If you are connecting remotely (outside of Genlab) then you may
receive a Duo push notification. You won’t be able to connect via
sftp until you confirm your identity in Duo.
Within sftp: * cd /home/shared/data *
get tissuesGeneExpression.rda * exit
Next, go back to this document in RStudio and modify the file path in
the following command so that it will load the .rda file
that you just downloaded:
load("tissuesGeneExpression.rda")
If you are having trouble locating this file, you can use the
Files panel in RStudio to browse through your
filesystem. Click on the .rda file to load it into the
global R workspace.
There should now be three data objects in your workspace:
e, tab and tissue. To confirm
this, run the command ls().
ls()
## [1] "e" "tab" "tissue"
Run this code block to get some information about the first object:
nrow(e) # number of rows
## [1] 22215
ncol(e) # number of columns
## [1] 189
str(e) # detailed summary
## num [1:22215, 1:189] 10.19 6.04 7.45 12.03 5.27 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:22215] "1007_s_at" "1053_at" "117_at" "121_at" ...
## ..$ : chr [1:189] "GSM11805.CEL.gz" "GSM11814.CEL.gz" "GSM11823.CEL.gz" "GSM11830.CEL.gz" ...
The e object is a numeric matrix. Each column represents
a frozen tissue specimen, from which the investigators extracted total
RNA. Each row represents a probe set on an Affymetrix Gene Chip. A probe
set binds to a specific bit of DNA (oligo). The more DNA that is bound
by the probe set, the brighter it gets. Each probe set is labeled with a
unique identifier, e.g., 1007_s_at.
A probe set corresponds to a specific mRNA transcript encoded by a gene in the human genome. If you’re curious, the table linking these probe set identifiers to genes can be viewed here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
Q1. How many different probes were there on the gene chip used in this experiment? How many specimens? (1 point)
There are 189 specimens and 22215 gene chips
To get a sense of what these expression data look like, let’s convert
e into a linear vector and run some basic commands:
vec <- as.numeric(e) # this command "unravels" the e matrix
summary(vec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.122 5.994 7.177 7.375 8.494 14.532
hist(vec)
Let’s look at what the expression levels are like between the first two specimens:
plot(e[, 1], e[, 2], xlab="Normal 1", ylab="Cancerous 2",
pch=19, cex=0.2, col=rgb(0,0,0,0.3))
This plot should drive home the fact that we are dealing with a LOT of data - after all, we are only looking at two specimens out of 189! It is also apparent that expression levels are strongly correlated between specimens. This is generally true for any pair of specimens.
The tab object is a data frame that contains metadata
about each of the samples. Modify the following code block to get
information about the shape and contents of this data frame.
str(tab)
## 'data.frame': 189 obs. of 6 variables:
## $ filename : chr "GSM11805.CEL.gz" "GSM11814.CEL.gz" "GSM11823.CEL.gz" "GSM11830.CEL.gz" ...
## $ DB_ID : chr "GSM11805" "GSM11814" "GSM11823" "GSM11830" ...
## $ ExperimentID : chr "GSE781" "GSE781" "GSE781" "GSE781" ...
## $ Tissue : chr "kidney" "kidney" "kidney" "kidney" ...
## $ SubType : chr "normal" "cancer" "normal" "cancer" ...
## $ ClinicalGroup: chr NA NA NA NA ...
Q2. Given what you know about the e
matrix, what do the rows of the tab data frame correspond
to? Explain your reasoning. (1 point)
The rows of the tab data frame correspond to individual specimens. This can be inferred because the number of rows in tab (189) is the same as the number of columns in e, which represent different tissue specimens.
We can use this information to annotate our clusters later on. To
access a field (column) in the data frame, use the $
operator. For example, tab$DB_ID will extract all the
database identifiers associated with these samples as a character
vector.
The table() command in R entabulates a categorical
variable. Here, we’re going to use this table command and
the $ operator to examine the composition of these
data.
table(tab$Tissue)
##
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Modify the following command to summarize the SubType
variable:
table(tab$SubType)
##
## cancer
## 8
## endometriosis, broard ligament
## 7
## first trimester placental organ explants grown on matrigel
## 6
## hepatocellular adenoma
## 5
## ihca
## 8
## incipient
## 7
## moderate
## 7
## normal
## 85
## pilocytic astrocytoma
## 5
## severe
## 7
## tumor
## 34
## tumor- stage 1
## 5
## tumor- stage 2
## 5
Q3. How many of the tissue specimens are “normal”? What was the next most frequent tissue subtype (i.e., not normal)? (1 point)
85 of them are normal. The most frequent tissue subtype is tumour.
Let’s display the first few rows of the tab data
frame.
head(tab)
## filename DB_ID ExperimentID Tissue SubType ClinicalGroup
## 288 GSM11805.CEL.gz GSM11805 GSE781 kidney normal <NA>
## 289 GSM11814.CEL.gz GSM11814 GSE781 kidney cancer <NA>
## 290 GSM11823.CEL.gz GSM11823 GSE781 kidney normal <NA>
## 291 GSM11830.CEL.gz GSM11830 GSE781 kidney cancer <NA>
## 292 GSM12067.CEL.gz GSM12067 GSE781 kidney cancer <NA>
## 293 GSM12075.CEL.gz GSM12075 GSE781 kidney normal <NA>
There is a lot of different annotations for SubType, so
let’s create a new variable called Cancerous:
tab$Cancerous <- ifelse(tab$SubType=='normal', FALSE, TRUE)
Now let’s cross-tabulate the Tissue and
Cancerous categories:
table(tab$Tissue, tab$Cancerous)
##
## FALSE TRUE
## cerebellum 33 5
## colon 0 34
## endometrium 8 7
## hippocampus 10 21
## kidney 21 18
## liver 13 13
## placenta 0 6
Q4. Based on the above table, would it be possible to draw conclusions about associations between gene expression and colon cancer? Why or why not? (1 point)
It would be difficult to draw definitive conclusions because the number of cancerous and normal tissues is imbalanced. For instance, some tissue types might have more cancerous specimens than others, making it challenging to generalize the results. More balanced data would be necessary for more reliable conclusions. In addition, you cannot draw associations between gene expression and colon cancer because the table just shows metadata.
Let’s make sure that the columns of e are ordered the
same way as rows in tab, because we’re going to want to
link data from one object to the other:
all(names(e) == tab$filename)
## [1] TRUE
Finally, the third data object (tissue) is a vector of
strings (a character vector), and it is totally redundant because the
same information is contained in tab.
Q5. Use the following code block to prove the above statement. Use the previous code block as reference if you’re not sure how to begin (1 point):
all(tissue == tab$Tissue)
## [1] TRUE
# The statement evaluates to TRUE, proving that tissue is redundant because it contains the same information as the Tissue column in tab.
To review, we have 189 observations of over 22,000 variables. Each variable represents the expression level associated with a specific probe on the Affymetrix Gene Chip.
To do clustering with these data, we need to calculate a pairwise
distance matrix. The function for performing this in R is called
dist(). Bring up the help documentation for
dist by either prefixing this command name with a
?, or enclosing it in a help() command, and
hit enter. (If multiple help pages come up, you want
stats::dist.)
(If you are running R on the command line, then you will enter a
terminal pager like less. You can scroll through the
document using the arrow keys and exit by typing q.)
Q6. What is the default distance measure used by
dist? What is another name for this distance? (2
points)
The default distance measure used by dist() is Euclidean distance. Another name for this distance is the 2 norm aka L2.
If we ran the dist function on e, we would
get a 22,215-by-22,215 distance matrix in return. (Don’t do this!) Since
we are more interested in the relationships among samples, we
need to transpose the e matrix using the t()
command.
Q7. Combine dist() and t()
by nesting one function within the other, and assign the result of
processing e to a new variable d. Enter your
command below and run it. (1 point)
d <- dist(t(e))
#The distance matrix d was calculated by transposing e and applying the dist() function.
Use the summary and hist functions on
d.
summary(d)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 121.7 151.9 142.7 165.1 241.1
hist(d)
Q8. What is the mean distance between samples? (1 point)
The mean distance between samples is approximately 142.7.
If we pick the first two probe sets, we can make an xy-plot (also known as a “scatterplot”) of the samples:
plot(e[1,], e[2, ], col=ifelse(tab$Cancerous, 'red', 'blue'))
Just looking at two probes seems to reveal a lot about the difference in gene expression patterns between normal and cancerous cells! However, we should be cautious about drawing conclusions from just two probes. If we scale up to all 22,215 variables, we are dealing with a cloud of points in an N-dimensional space where N=22,215.
It is impossible to arrange the 189 data points in two dimensions (i.e., on a screen) in a way that accurately preserves these distances. Put another way, imagine that you have a three-dimensional marshmallow-and-drinking straw sculpture. It will be impossible to “squish” this sculpture onto a tabletop in a way that keeps all marshmallows connected.
So, if we can’t do the impossible, we can at least try to make something that’s “good enough”. Since we’re working with a distance matrix, we can use a method known as multidimensional scaling (MDS). MDS embeds each point into a lower-dimensional space, preserving as much of the distances between points as possible. In other words, it’s a dimensionality reduction method.
There is a function in R called cmdscale that runs MDS
on objects of class dist, so let’s use it here:
res <- cmdscale(d)
par(mar=c(5,5,1,1)) # override default figure margins
plot(res, xlab="Coordinate 1", ylab="Coordinate 2")
We can colour this plot with respect to whether the tissue specimen is normal or some cancerous subtype:
plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
col=ifelse(tab$Cancerous, 'red', 'blue'))
Q9. What does the above MDS plot tell you about the
distribution of specimens with respect to gene expression? Do you see
clusters? What about the distribution of Cancerous labels
among clusters? Provide a brief interpretation of your results:
(3 points)
The MDS plot reveals clusters of specimens based on similarities and differences in gene expression patterns. Cancerous specimens (red) tend to form distinct clusters that are largely separate from normal specimens (blue). This suggests that gene expression is likely associated with cancer status, as cancerous and normal tissues show some clear separation in their expression profiles. However, there is some overlap between the two groups, indicating that while gene expression patterns do correlate with cancer status, other factors such as individual variability or technical noise might influence these clusters. Therefore, gene expression alone may not fully distinguish between cancerous and normal specimens.
Let’s colour the plot with respect to Tissue; I’m also
going to use point size to represent Cancerous (large if
true, small if false):
tab$Tissue <- as.factor(tab$Tissue)
pal <- hcl.colors(n=nlevels(tab$Tissue), palette="Set2", alpha=0.6)
plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
pch=ifelse(tab$Cancerous, 22, 21),
bg=pal[as.integer(tab$Tissue)],
cex=ifelse(tab$Cancerous, 1.3, 0.7))
legend(x=0, y=130, legend=levels(tab$Tissue), xjust=0.5,
bty='n', xpd=NA, fill=pal, ncol=4)
Visualizing complex data in two dimensions is very limiting. It is possible to view three dimensions using an interactive animation in R:
require(rgl) # you may need to run `install.packages("rgl")`
res3 <- cmdscale(d, k=3)
# if an interactive 3D plot does not appear, you may need to run:
# `options(rgl.printRglwidget = TRUE)`
plot3d(res3, col=pal[as.integer(tab$Tissue)], type='s',
size=ifelse(tab$Cancerous, 1, 0.5))
Q10. Inspect the interactive 3D plot by click-and-dragging the box. The colour and size scheme is the same as the previous 2D plot. What does adding a third dimension reveal about associations between gene expression, tissue type and cancer state? (2 points)
Adding a third dimension enhances the separation of clusters and helps distinguish tissue types more clearly. The third dimension reveals that the spatial separation between normal and cancerous samples is more distinct when considering the third axis, suggesting deeper associations between gene expression and tissue type.
Let’s load the cluster package:
require(cluster) # you may need to run `install.packages("cluster")`
## Loading required package: cluster
Bring up the help text for the command kmeans.
Q11. Use the following code block to run a k-means
clustering analysis with 3 centers, and assign the return value to a new
variable km. (NOTE kmeans
should use the transposed data matrix as input x,
not the distance matrix.) (1 point):
set.seed(3) # to make results comparable among assignments
km <- kmeans(t(e), 3) # complete this command!
You can visualize the k-means clustering results using the following commands:
pal <- hcl.colors(n=nlevels(tab$Tissue), palette="Set2", alpha=0.6)
plot(res, xlab="Coordinate 1", ylab="Coordinate 2",
pch=ifelse(tab$Cancerous, 22, 21),
bg=pal[as.integer(tab$Tissue)],
cex=ifelse(tab$Cancerous, 1.3, 0.7))
legend(x=0, y=130, legend=levels(tab$Tissue), xjust=0.5,
bty='n', xpd=NA, fill=pal, ncol=4)
draw.cluster <- function(k, lty=1) {
x <- res[km$cluster==k, 1]
y <- res[km$cluster==k, 2]
ch <- chull(x, y)
polygon(x[ch], y[ch], lty=lty)
}
draw.cluster(1, lty=1) # solid line
draw.cluster(2, lty=2) # dashed line
draw.cluster(3, lty=3) # dotted line
Q12. Do you think your analysis would benefit from fewer or more centers? Briefly discuss why, with reference to any of the plots that you’ve generated so far. (2 points)
The analysis might benefit from more centers, as there appears to be some separation within the clusters formed by cancerous and normal samples. Increasing the number of clusters might better capture the diversity within each cancerous or normal group. As seen in the 3D plot, there are more than 3 main clusters, thus supporting the idea that there needs to be more than 3 centres.
Finally, let’s use the hclust function to apply some
different hierarchical clustering methods to these data.
Q13. Read the help text for hclust().
What is the default linkage criterion, and what does it mean? (1
point)
The default linkage criterion is complete. The complete criterion finds similar clusters.
Run this code block to view the results of hierarchical clustering under default settings:
hc <- hclust(d, "ward.D2")
plot(hc, cex=0.6, label=paste(tab$Tissue, tab$Cancerous, sep='.'))
Modify the preceding code block to use Ward’s clustering criterion,
and then adjust the following code block to “cut” the hclust tree
(dendrogram) at a given height (h) to retrieve cluster
assignments for the 189 samples:
hclus <- cutree(hc, h=150) # make sure you set h to some other value!
table(tab$Tissue, hclus, tab$Cancerous)
## , , = FALSE
##
## hclus
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## cerebellum 0 0 0 0 31 0 0 0 0 0 2 0 0 0 0 0 0 0
## colon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## endometrium 0 0 0 0 0 0 0 0 0 0 0 0 7 1 0 0 0 0
## hippocampus 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## kidney 1 10 0 0 0 0 0 8 0 0 2 0 0 0 0 0 0 0
## liver 0 0 0 0 0 0 0 0 7 0 0 2 0 0 4 0 0 0
## placenta 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## , , = TRUE
##
## hclus
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## cerebellum 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0
## colon 0 0 0 0 0 0 34 0 0 0 0 0 0 0 0 0 0 0
## endometrium 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0
## hippocampus 0 0 8 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## kidney 8 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0
## liver 0 0 0 0 0 0 0 0 0 5 0 0 0 0 8 0 0 0
## placenta 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4
Q14. Write a brief summary of the tabular results from your hierarchical clustering analysis, i.e., how your clusters map to tissue type and normal/cancerous subtypes: (3 points)
The results from the hierarchical clustering analysis show distinct groupings based on tissue types and the cancerous status of the specimens. Upon cutting the dendrogram at a specified height, the clusters reveal the following trends: Cluster Composition: Each cluster contains a mixture of normal and cancerous specimens, indicating that gene expression levels can vary significantly within each tissue type. However, certain clusters appear to predominantly contain either normal or cancerous samples, suggesting possible underlying biological differences. Tissue Types: Specific tissue types are more frequently associated with cancerous specimens, as observed in certain clusters. For example, if a cluster predominantly contains tumor samples, this might reflect a unique gene expression profile associated with that specific tissue type. Cancerous vs. Normal: The analysis indicates that while some cancerous samples cluster together, there is overlap with normal samples, which implies that gene expression profiles of normal and cancerous tissues can be quite similar, potentially complicating the identification of cancerous tissues based solely on gene expression data. Overall, the hierarchical clustering analysis provides insights into the relationships between tissue types and cancer states, highlighting the complexity of gene expression patterns across different specimens.
When you done with your assignment, click the “Knit” button to convert this file into an HTML document. Upload the resulting HTML file as an attachment in Brightspace to submit your assignment.