The distance between two points can be described using scalar
notation as follows
\[ d_{ij} = \sqrt{\sum_{k=1}^m (x_{ik}-x_{jk})^2} \]
Let’s take a table containing 4 rows (observations) and 4 columns (variables)
df=data.frame(c(1,2,3,4),c(3,4,5,6),c(5,6,7,8),c(7,8,9,10))
df2=t(df)
row.names(df2) = c('a','b','c','d')
colnames(df2) = c('x1','x2','x3','x4')
df3 = data.frame(df2)
#DT::datatable(df2, option=list(dom='t'))
knitr::kable(df2)
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| a | 1 | 2 | 3 | 4 |
| b | 3 | 4 | 5 | 6 |
| c | 5 | 6 | 7 | 8 |
| d | 7 | 8 | 9 | 10 |
NA
For example, isolate the first two rows, a and b and calculate the deviations for each variable
DT::datatable(df3[1:2,], options = list(dom='t'))
NA
In this case the discards are all equal and are equal to 2, therefore
the sum of their squares will be
\[
2^2+2^2+2^2+2^2 = 16
\]
Then the distance will be \(d = \sqrt{16}\), or \(4\)
Similarly we can calculate the distance between two vectors using the
scalar product
\[
d_{ij} = \sqrt{{\bf (x_i - x_j)\cdot (x_i - x_j)}^T}
\]
From the table above we always isolate the two rows a and b but
considering them as two vectors \(\bf
a\) and \(\bf b\) of a space in
\(m=4\) dimensions
a = as.numeric(df2[1,])
b = as.numeric(df2[2,])
a
[1] 1 2 3 4
b
[1] 3 4 5 6
z= b-a
Let’s call \(\bf z\) the vector resulting from differences in vector components \(\bf a\) and \(\bf b\).
z
[1] 2 2 2 2
Now do the dot product between this vector and its transpose: \({\bf z \cdot z}^T\)
z%*%z
[,1]
[1,] 16
I get, as before, the value of \(16\), whose square root is precisely \(4\)
The Manhattan distance between two vectors, a and b, for \(m\) variables is calculated as:
\[
d_{ab} = \sum_{i=1}^m |a_{i}-b_{k}|
\]
manhattan_dist <- function(a, b){
dist <- abs(a-b)
dist <- sum(dist)
return(dist)
}
manhattan_dist(a,b)
[1] 8