A distance function or metric a on the space \(\mathbb{R}^n,\:n\geq 1\), is a function \(d:\mathbb{R}^n\times\mathbb{R}^n\rightarrow \mathbb{R}\). It must satisfy some required axioms.
P1. \(d(\mathbf{x},\mathbf{y})= 0\iff \mathbf{x}=\mathbf{y}\) (identity of indiscernibles);
P2. \(d(\mathbf{x},\mathbf{y})= d(\mathbf{y},\mathbf{x})\) (symmetry);
P3. \(d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\geq d(\mathbf{x},\mathbf{z})\) (triangle inequality),
where \(\mathbf{x}=(x_1,\cdots,x_n)\), \(\mathbf{y}=(y_1,\cdots,y_n)\) and \(\mathbf{z}=(z_1,\cdots,z_n)\) are any three vectors of \(\mathbb{R}^n\).
Exercice 2: Pove that these three axioms imply the non-negativity condition: \(d(\mathbf{x},\mathbf{y})\geq 0\).
We should use the term dissimilarity instead of distance when not all mathematical axioms for distances, that is P1-P3, are valid.
Let us recall and introduce some useful metrics:
\[d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{i=1}^n (x_i-y_i)^2}.\]
\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n |x_i-y_i|.\]
There exists also a weighted version of the above distance given by:
\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n \frac{|x_i-y_i|}{|x_i|+|y_i|}.\]
Note that the term \(|x_i−y_i|/(|x_i|+|y_i|)\) needs to be replaced by zero if both \(x_i\) and \(y_i\) are zero and that the Canberra distance is specially sensitive to small changes near zero.
Exercice 2: Prove that the Canberra distance is a true distance.
Both the Euclidian and Manattan distances are special cases of the Minkowski distance which is now defined:
\[d(\mathbf{x},\mathbf{y}) \left[\sum_{i=1} |x_i-y_i|{p}\right]^{1/p}, \]
Let also us define: \[\|\mathbf{x}\|_p\equiv\left[\sum_{i=1}^n |x_i|^{p}\right]^{1/p},\: p\geq 1,\] where \(\|\mathbf{\cdot}\|_p\) is known as the p-norm or Minkowski norm. Note that the Minkowski distance can be defined as follows: \[ d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p,\:p\geq 1. \] The proof of the triangular inequality P3 is based on the Minkowski inequality which states that for any nonnegative real numbers \(a_1,\cdots,a_n\); \(b_1,\cdots,b_n\), we have:
\[ \left[\sum_{i=1}^n (a_i+b_i)^{p}\right]^{1/p}\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} + \left[\sum_{i=1}^n b_i^{p}\right]^{1/p},\:p\geq 1. \] To prove that the Minkowski distance satisfies P3, notice that
\[ \sum_{i=1}^n|x_i-z_i|^{p}= \sum_{i=1}^n|(x_i-y_i)+(y_i-z_i)|^{p}. \] Noticing then that for any reals \(x,y\), we have \(|x+y|\leq |x|+|y|\), and using the fact that \(a^p\) is increasing in \(a>0\), we obtain
\[ \sum_{i=1}^n|x_i-z_i|^{p}\leq \sum_{i=1}^n(|x_i-y_i|+|y_i-z_i|)^{p}. \] Applying then the Minkowski inequality to the RHS of the above inequality by posing \(a_i=|x_i-y_i|\) and \(b_i=|y_i-z_i|\), \(i=1,\cdots,n\), we get \[ \sum_{i=1}^n|x_i-z_i|^{p}\leq \left(\sum_{i=1}^n |x_i-y_i|^{p}\right)^{1/p}+\left(\sum_{i=1}^n |y_i-z_i|^{p}\right)^{1/p}. \]
The proof of the Minkowski inequality itself requires the Hölder inequality which states that for any nonnegative real numbers \(a_1,\cdots,a_n\); \(b_1,\cdots,b_n\), and any \(p,q>1\) with \(1/p+1/q=1\), we have
\[ \sum_{i=1}^n a_ib_i\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} \left[\sum_{i=1}^n b_i^{q}\right]^{1/q} \] The proof of the above displayed Hölder inequality relies on the Young’s inequality: For any \(a,b>0\) we have \[ ab\leq \frac{a^p}{p}+\frac{b^q}{q}, \] with equality occuring iff: \(a^p=b^q\). To prove the Young’s inequality, one can use the (strict) convexity of the exponential function which tels us that for any reals \(x,y\), then
\[ e^{\frac{x}{p}+\frac{y}{q} }\leq \frac{e^{x}}{p}+\frac{e^{y}}{q}. \] We then set \(x=p\ln a\) and \(y=q\ln b\) to get the Young’s inequality. A good reference on the inequalities topic is: Z. Cvetkovski, Inequalities: theorems, techniques and selected problems, 2012, Springer Science & Business Media.
Note that the triangular inequality for the Minkowski distance implies
\[ \sum_{i=1}^n |x_i|\leq \left[\sum_{i=1}^n |x_i|^{p}\right]^{1/p} ,\:p\geq 1. \]
Note that for \(p=2\), we have \(q=2\). The Hölder inequality implies for that special case \[ \sum_{i=1}^n|x_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \]
Since the LHS od thes above inequality is greater then \(|\sum_{i=1}^nx_iy_i|\), we get the Cauchy-Schwartz inequality \[ |\sum_{i=1}^nx_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \] Using the dot product noation called also scalar product noation \(\mathbf{x\cdot y}=\sum_{i=1}^nx_iy_i\), and the norm notation \(\|\mathbf{\cdot}\|_2 \|\), the Cauchy-Schwart inequality is:
\[|\mathbf{x\cdot y} | \leq \|\mathbf{x}\|_2 \| \mathbf{y}\|_2 \]
The cosine of the angle \(\theta\) between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is a measure of similarity given by: \[ \cos(\theta)=\frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|_2\|\mathbf{y}\|_2}=\frac{\sum_{i=1}^n x_i y_i}{{\sqrt{\sum_{i=1}^n x_i^2\sum_{i=1}^n y_i^2}}}. \] Note that the cosine of the angle between the two centred vectors \((x_1-\bar{\mathbf{x}},\cdots,x_n-\bar{\mathbf{x}})\) and \((y_1-\bar{\mathbf{y}},\cdots,y_n-\bar{\mathbf{y}})\) coincides with the Pearson correlation coefficient of \(\mathbf{x}\) and \(\mathbf{y}\).
The cosine correlation distance is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\cos(\theta). \] It shares similar properties than the Pearson correlation distance. Likewise, Axioms P1 and P3 are not satisfied.
To calculate the Spearman rank-order correlation or Spearman correlation coefficient for short, we need to map seperately each of the vectors to ranked data values: \(\mathbf{x}\rightarrow \mathbf{x}^r=(x_1^r,\cdots,x_n^r)\). Here, \(x_i^r\) is the rank of \(x_i\) among the set of values of \(\mathbf{x}\). We illustrate this transformation with a simple example. If \(\mathbf{x}=(3, 1, 4, 15, 92)\), then the rank-order vector is \(\mathbf{x}^r=(2,1,3,4,5)\).
x=c(3, 1, 4, 15, 92)
xr=rank(x)
The Spearman correlation of two numerical variables \(\mathbf{x}\) and \(\mathbf{y}\) is simply the Pearson correlation of the two correspnding rank-oerder variables \(\mathbf{x}^r\) and \(\mathbf{y}^r\), i.e. \(\rho(\mathbf{x}^r,\mathbf{y}^r)\). This measure is is useful because it is more robust against outliers than the Pearson correlation.
If all the \(n\) ranks are distinct, it can be computed using the following formula:
\[ \rho(\mathbf{x}^r,\mathbf{y}^r)=1-\frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \] where \(d_i=x_i^r-y_i^r\).
The spearman distance is then defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x^r},\mathbf{y^r}). \]
As all the \(n\) ranks are distinct, we get: \[ d(\mathbf{x},\mathbf{y})=\frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}. \]
As an example, consider these data
x=c(3, 1, 4, 15, 92)
xr=rank(x)
y=c(30,2 , 9, 20, 48)
yr=rank(y)
d=xr-yr
d
cor(xr,yr)
1-6*sum(d^2)/(5*(5^2-1))
ggggggggggg