Only do this if needed
# install.packages("ggplot2")
# install.packages("ggpubr")
library(compbio4all)
library(ggplot2)
library(ggpubr)
using color to display a third dimenion, distance, Euclidean distance, numeric data, binary data, categorical data, dist()
This is a toy analysis to get accross a specific point about distances. The data are real and there are interesting biological questions about these data, but this is not a way explore them.
From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2651158/table/T1/
# data
GC <- c(42.3,37.5,22.6,19.4)
mb <- c(26.8,23.5,23.1,23.3)
genes <- c(5433,5188,5875,5403)
chromosomes <- c(14,14,14,14)
spp <- c("vivax", "knowlesi","yoelii","falciparum")
# make dataframe
plasmo <- data.frame(spp, mb, GC,genes)
# label rows
row.names(plasmo) <- spp
A third dimenions of the data is added by using size = genes.
ggscatter(y = "mb",
x = "GC",
color = "spp",
size = "genes",
data = plasmo,
label = "spp",
xlim = c(18,42.5))
The firs step in cluster analysis and many related exploratory methods is to select a measure of distance. There are many ways to calculate distances, and the proper choice depends on the type of data and the analysis.
When data are numeric (measured with a scale, calipers, rulers and/or can take on decimal values, and/or can be described by Bell-curve/Normal distribituion) Euclidean distances may be appropriate.
When data are binary (0 or 1, present or absent), categorical (A, T, C, or G) other methods will almost definitely be better.
When data are counts it will depend. When counts are relatively low and can contain zero (number of offspring) you may have to be careful. When counts are large (number of RNA transcripts) it may be just fine.
Each field (ecology, phylogenetics, genomics) has best practices about selecting the proper distance.
It can be tempting to try out different distance methods to see which one gives you the results you like. This is a form of p-hacking: manipulating your data, analysis, etc to get the statistical result or graph you want.
https://bitesizebio.com/31497/guilty-p-hacking/#:~:text=The%20term%20p%2Dhacking%2C%20coined,decisions%20made%20by%20the%20investigator.
Euclidean distance is what we learned in high-school geometry
a^2 + b^2 = c^2
We’ll use this notation below: h^2 + v^2 = d^2
Euclidean distance extend intuitively to three dimensions. It can also be extended to any number of dimensions with no modification to the the math, though its much harder to visualize.
First, calculate the horizontal distance (x-axis; GC)
The data are
plasmo[c(1,2),]
## spp mb GC genes
## vivax vivax 26.8 42.3 5433
## knowlesi knowlesi 23.5 37.5 5188
GC content is on the x-axis. Vivax has a GC content of 42.3. knowlesi has a GC content of 37.5 The horizontal distance is
42.3-37.5
## [1] 4.8
Let’s call this distance h
h <- 42.3-37.5
The horizontal distance between the points can be visualized like this
ggscatter(y = "mb",
x = "GC",
color = "spp",
size = "genes",
data = plasmo,
label = "spp",
xlim = c(18, 42.5),
ylim = c(23, 27)) +
# vivax vs knowlesi
# horiz line
geom_segment(aes(y = 26.8, x = 37.5,
yend = 26.8, xend = 42.3,
colour = "segment")) +
annotate("text",
x = mean(c(42.3,37.5)),
y = 27.1,
label = h)
Genome size in megabases is on the y-axis. Vivax has a genome size of mb = 26.8 and knowlesi is mb = 23.5. The vertical distance is therefore
26.8-23.5
## [1] 3.3
Let’s call this distance v
v <- 26.8-23.5
We can visualize the vertical and horizontal distances like this
ggscatter(y = "mb",
x = "GC",
color = "spp",
size = "genes",
data = plasmo,
label = "spp",
xlim = c(18, 42.5),
ylim = c(23, 27)) +
# vivax vs knowlesi
# horiz line
geom_segment(aes(y = 26.8, x = 37.5,
yend = 26.8, xend = 42.3,
colour = "segment")) +
annotate("text",
x = mean(c(42.3,37.5)),
y = 27.1,
label = h) +
# vert line
geom_segment(aes(y = 23.5, x = 37.5,
yend = 26.8, xend = 37.5,
colour = "segment")) +
annotate("text",
y = mean(c(26.8,23.5)),
x = 35.1,
label = v)
The Euclidean distance is
d2 <- h^2 + v^2
d <- sqrt(d2)
d
## [1] 5.824946
Round it off and call it “d.vk” for “vivax to knowlesi”.
d.vk <- round(d, 3)
We can visualize this
ggscatter(y = "mb",
x = "GC",
color = "spp",
size = "genes",
data = plasmo,
label = "spp",
xlim = c(18, 42.5),
ylim = c(23, 27)) +
# vivax vs knowlesi
# horiz line
geom_segment(aes(y = 26.8, x = 37.5,
yend = 26.8, xend = 42.3,
colour = "segment")) +
annotate("text",
x = mean(c(42.3,37.5)),
y = 27.1,
label = h) +
# vert line
geom_segment(aes(y = 23.5, x = 37.5,
yend = 26.8, xend = 37.5,
colour = "segment")) +
annotate("text",
y = mean(c(26.8,23.5)),
x = 35.1,
label = v) +
# diag line
geom_segment(aes(y = 23.5, x = 37.5,
yend = 26.8, xend = 42.3,
colour = "segment"))+
annotate("text",
y = mean(c(26.8,23.5)),
x = 40.95,
label = d.vk)
If you want to be fancy, you can calculate the Euclidean distance in one step like this
sqrt((42.3-37.5)^2 + (26.8-23.5)^2 )
## [1] 5.824946
These distance calcualtions are carried out pairwise for each pair of species. For example, the data from falciparum and yeoli are
plasmo[c(3,4),]
## spp mb GC genes
## yoelii yoelii 23.1 22.6 5875
## falciparum falciparum 23.3 19.4 5403
You can work the math out step by step yourself In one line its this, saved to d.fy for distance to falciparum to yoeli
d.fy <- sqrt((23.1-23.3)^2 + (22.6 - 19.4)^2)
We can visualized the horizontal, vertical, and diagonal (Euclidean) distances like this
ggscatter(y = "mb",
x = "GC",
color = "spp",
size = "genes",
data = plasmo,
label = "spp") +
# falciparum vs yoelii
# horiz line
geom_segment(aes(y = 23.1, x = 22.6,
yend = 23.1, xend = 19.4,
colour = "segment")) +
# vert line
geom_segment(aes(y = 23.1, x = 19.4,
yend = 23.3, xend = 19.4,
colour = "segment")) +
# diag line
geom_segment(aes(y = 23.1, x = 22.6,
yend = 23.3, xend = 19.4,
colour = "segment"))
This would be repeated for all 6 possible comparisons. For knowlesi to yeolii it is
# knowlesi vs yoelii
plasmo[c(2,3),]
## spp mb GC genes
## knowlesi knowlesi 23.5 37.5 5188
## yoelii yoelii 23.1 22.6 5875
d.ky <- sqrt((23.5-23.1)^2 + (37.5 - 22.6)^2)
Compare my math to the dist() function. See the help file for other distance metrics.
d.vk
## [1] 5.825
d.fy
## [1] 3.206244
d.ky
## [1] 14.90537
dist(plasmo[,c("GC","mb")],method = "euclidean")
## vivax knowlesi yoelii
## knowlesi 5.824946
## yoelii 20.044451 14.905368
## falciparum 23.165923 18.101105 3.206244
Save the distance matrix
d_mat <- dist(plasmo[,c("GC","mb")],method = "euclidean")
(THIS INFORMATION WAS NOT COVERED IN FALL 2021)
The first cluster is formed by combining the two closest items. In this case we are considering taxa, though the data we are working with would never be used to build an acutal phylogenetic tree.
The minimum distance is yeolii to falciparum
d.fy
## [1] 3.206244
The minimum function works on a matrix like this
min(d_mat)
## [1] 3.206244
The distance between the cluster is just this distance. Cluster analysis typically assumes a perfectly bifurcating (branching) tree. This looks like this
d_mat_temp <- dist(plasmo[c(3,4),c("GC","mb")],method = "euclidean")
clust_temp <- as.dendrogram(hclust(d_mat_temp))
plot(clust_temp,type = "triangle", axes = F)