Euclidean distances

Preliminaries

Download packages

Only do this if needed

# install.packages("ggplot2")
# install.packages("ggpubr")

Install packages

library(compbio4all)
library(ggplot2)
library(ggpubr)

Vocab

using color to display a third dimenion, distance, Euclidean distance, numeric data, binary data, categorical data, dist()

Note: This is NOT a real analysis

This is a toy analysis to get accross a specific point about distances. The data are real and there are interesting biological questions about these data, but this is not a way explore them.

Load data

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2651158/table/T1/

GC = the G+C% of the genome
mb = the size of the genome in megabase
spp = specifif epithet of the organism
chromosomes
genes = predicted number of genes in the genome

# data
GC <- c(42.3,37.5,22.6,19.4)
mb <- c(26.8,23.5,23.1,23.3)
genes <- c(5433,5188,5875,5403)
chromosomes <- c(14,14,14,14)
spp <- c("vivax", "knowlesi","yoelii","falciparum")

# make dataframe
plasmo <- data.frame(spp, mb, GC,genes)

# label rows
row.names(plasmo) <- spp

Plot the data

A third dimenions of the data is added by using size = genes.

ggscatter(y = "mb",
          x = "GC",
          color = "spp",
          size = "genes",
          data = plasmo,
          label = "spp",
          xlim = c(18,42.5))

Distances and distance matrices

The firs step in cluster analysis and many related exploratory methods is to select a measure of distance. There are many ways to calculate distances, and the proper choice depends on the type of data and the analysis.

When data are numeric (measured with a scale, calipers, rulers and/or can take on decimal values, and/or can be described by Bell-curve/Normal distribituion) Euclidean distances may be appropriate.
When data are binary (0 or 1, present or absent), categorical (A, T, C, or G) other methods will almost definitely be better.
When data are counts it will depend. When counts are relatively low and can contain zero (number of offspring) you may have to be careful. When counts are large (number of RNA transcripts) it may be just fine.
Each field (ecology, phylogenetics, genomics) has best practices about selecting the proper distance.
It can be tempting to try out different distance methods to see which one gives you the results you like. This is a form of p-hacking: manipulating your data, analysis, etc to get the statistical result or graph you want.

https://bitesizebio.com/31497/guilty-p-hacking/#:~:text=The%20term%20p%2Dhacking%2C%20coined,decisions%20made%20by%20the%20investigator.

Calculate Euclidean distance

Euclidean distance is what we learned in high-school geometry

a^2 + b^2 = c^2

We’ll use this notation below: h^2 + v^2 = d^2

h = the horizontal distance between 2 points
v = vertical distance between 2 point
d = the diagonal Euclidean distance between 2 points

Euclidean distance extend intuitively to three dimensions. It can also be extended to any number of dimensions with no modification to the the math, though its much harder to visualize.

Distance from vivax to knowlesi

First, calculate the horizontal distance (x-axis; GC)

The data are

plasmo[c(1,2),]

##               spp   mb   GC genes
## vivax       vivax 26.8 42.3  5433
## knowlesi knowlesi 23.5 37.5  5188

GC content is on the x-axis. Vivax has a GC content of 42.3. knowlesi has a GC content of 37.5 The horizontal distance is

42.3-37.5

## [1] 4.8

Let’s call this distance h

h <- 42.3-37.5

The horizontal distance between the points can be visualized like this

ggscatter(y = "mb",
          x = "GC",
          color = "spp",
          size = "genes",
          data = plasmo,
          label = "spp",
          xlim = c(18, 42.5),
          ylim = c(23, 27)) +
    # vivax vs knowlesi
   # horiz line
   geom_segment(aes(y    = 26.8,    x = 37.5,
                   yend = 26.8, xend = 42.3,
                   colour = "segment")) +
  annotate("text",
           x = mean(c(42.3,37.5)), 
           y = 27.1,
           label = h)

Genome size in megabases is on the y-axis. Vivax has a genome size of mb = 26.8 and knowlesi is mb = 23.5. The vertical distance is therefore

26.8-23.5

## [1] 3.3

Let’s call this distance v

v <- 26.8-23.5

We can visualize the vertical and horizontal distances like this

ggscatter(y = "mb",
          x = "GC",
          color = "spp",
          size = "genes",
          data = plasmo,
          label = "spp",
          xlim = c(18, 42.5),
          ylim = c(23, 27)) +
    # vivax vs knowlesi
   # horiz line
   geom_segment(aes(y    = 26.8,    x = 37.5,
                   yend = 26.8, xend = 42.3,
                   colour = "segment")) +
  annotate("text",
           x = mean(c(42.3,37.5)), 
           y = 27.1,
           label = h) +
  # vert line
   geom_segment(aes(y    = 23.5,    x = 37.5,
                   yend = 26.8, xend = 37.5,
                   colour = "segment")) +
    annotate("text",
           y = mean(c(26.8,23.5)), 
           x = 35.1,
           label = v)

The Euclidean distance is

d2 <- h^2 + v^2
d <- sqrt(d2)
d

## [1] 5.824946

Round it off and call it “d.vk” for “vivax to knowlesi”.

d.vk <- round(d, 3)

We can visualize this

ggscatter(y = "mb",
          x = "GC",
          color = "spp",
          size = "genes",
          data = plasmo,
          label = "spp",
          xlim = c(18, 42.5),
          ylim = c(23, 27)) +
    # vivax vs knowlesi
   # horiz line
   geom_segment(aes(y    = 26.8,    x = 37.5,
                   yend = 26.8, xend = 42.3,
                   colour = "segment")) +
  annotate("text",
           x = mean(c(42.3,37.5)), 
           y = 27.1,
           label = h) +
  # vert line
   geom_segment(aes(y    = 23.5,    x = 37.5,
                   yend = 26.8, xend = 37.5,
                   colour = "segment")) +
    annotate("text",
           y = mean(c(26.8,23.5)), 
           x = 35.1,
           label = v) +
    # diag line
  geom_segment(aes(y    = 23.5,    x = 37.5, 
                   yend = 26.8, xend = 42.3, 
                   colour = "segment"))+
    annotate("text",
           y = mean(c(26.8,23.5)), 
           x = 40.95,
           label = d.vk)

If you want to be fancy, you can calculate the Euclidean distance in one step like this

sqrt((42.3-37.5)^2 + (26.8-23.5)^2 )

## [1] 5.824946

Distance from falciparum to yoelli

These distance calcualtions are carried out pairwise for each pair of species. For example, the data from falciparum and yeoli are

plasmo[c(3,4),]

##                   spp   mb   GC genes
## yoelii         yoelii 23.1 22.6  5875
## falciparum falciparum 23.3 19.4  5403

You can work the math out step by step yourself In one line its this, saved to d.fy for distance to falciparum to yoeli

d.fy <- sqrt((23.1-23.3)^2 + (22.6 - 19.4)^2)

We can visualized the horizontal, vertical, and diagonal (Euclidean) distances like this

ggscatter(y = "mb",
          x = "GC",
          color = "spp",
          size = "genes",
          data = plasmo,
          label = "spp") +
  # falciparum vs yoelii
   # horiz line
   geom_segment(aes(y    = 23.1,    x = 22.6, 
                   yend = 23.1, xend = 19.4, 
                   colour = "segment")) +
  # vert line
   geom_segment(aes(y    = 23.1,    x = 19.4, 
                   yend = 23.3, xend = 19.4, 
                   colour = "segment")) +
  # diag line
  geom_segment(aes(y    = 23.1,    x = 22.6, 
                   yend = 23.3, xend = 19.4, 
                   colour = "segment"))

Distance knowlesi to yoelli

This would be repeated for all 6 possible comparisons. For knowlesi to yeolii it is

# knowlesi vs yoelii
plasmo[c(2,3),]

##               spp   mb   GC genes
## knowlesi knowlesi 23.5 37.5  5188
## yoelii     yoelii 23.1 22.6  5875

d.ky <- sqrt((23.5-23.1)^2 + (37.5 - 22.6)^2)

Distances with the dist() function

Compare my math to the dist() function. See the help file for other distance metrics.

d.vk

## [1] 5.825

d.fy

## [1] 3.206244

d.ky

## [1] 14.90537

dist(plasmo[,c("GC","mb")],method = "euclidean")

##                vivax  knowlesi    yoelii
## knowlesi    5.824946                    
## yoelii     20.044451 14.905368          
## falciparum 23.165923 18.101105  3.206244

Save the distance matrix

d_mat <- dist(plasmo[,c("GC","mb")],method = "euclidean")

Forming the first cluster

(THIS INFORMATION WAS NOT COVERED IN FALL 2021)

The first cluster is formed by combining the two closest items. In this case we are considering taxa, though the data we are working with would never be used to build an acutal phylogenetic tree.

The minimum distance is yeolii to falciparum

d.fy

## [1] 3.206244

The minimum function works on a matrix like this

min(d_mat)

## [1] 3.206244

The distance between the cluster is just this distance. Cluster analysis typically assumes a perfectly bifurcating (branching) tree. This looks like this

d_mat_temp <- dist(plasmo[c(3,4),c("GC","mb")],method = "euclidean")
clust_temp <- as.dendrogram(hclust(d_mat_temp))
plot(clust_temp,type = "triangle", axes = F)