Cluster Analysis

Distance

A distance function or metric a on the space \(\mathbb{R}^n,\:n\geq 1\), is a function \(d:\mathbb{R}^n\times\mathbb{R}^n\rightarrow \mathbb{R}\). It must satisfy some required axioms.

Axioms

P1. \(d(\mathbf{x},\mathbf{y})= 0\iff \mathbf{x}=\mathbf{y}\) (identity of indiscernibles);

P2. \(d(\mathbf{x},\mathbf{y})= d(\mathbf{y},\mathbf{x})\) (symmetry);

P3. \(d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\geq d(\mathbf{x},\mathbf{z})\) (triangle inequality),

where \(\mathbf{x}=(x_1,\cdots,x_n)\), \(\mathbf{y}=(y_1,\cdots,y_n)\) and \(\mathbf{z}=(z_1,\cdots,z_n)\) are any three vectors of \(\mathbb{R}^n\).

Exercice

Prove that the three axioms imply the non-negativity condition: \(d(\mathbf{x},\mathbf{y})\geq 0\).

Distance/Dissimilarity

We should use the term dissimilarity instead of distance when not all mathematical axioms for distances, that is P1-P3, are valid.

Euclidean distance:

\[d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{i=1}^n (x_i-y_i)^2}.\]

Manhattan distance:

\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n |x_i-y_i|.\]

There exists also a weighted version of the above distance called Canberra distance.

Canberra distance

\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n \frac{|x_i-y_i|}{|x_i|+|y_i|}.\]

Note that the term \(|x_i−y_i|/(|x_i|+|y_i|)\) needs to be replaced by zero if both \(x_i\) and \(y_i\) are zero and that the Canberra distance is specially sensitive to small changes near zero.

Exercice

Prove that the Canberra distance is a true distance.

Minkowski distance

Both the Euclidian and Manattan distances are special cases of the Minkowski distance which is now defined.

Minkowski distance (cont.)

\[d(\mathbf{x},\mathbf{y}) \left[\sum_{i=1} |x_i-y_i|^{p}\right]^{1/p}, \]

Minkowski norm

Let also us define: \[\|\mathbf{x}\|_p\equiv\left[\sum_{i=1}^n |x_i|^{p}\right]^{1/p},\: p\geq 1,\]

where \(\|\mathbf{\cdot}\|_p\) is known as the p-norm or Minkowski norm.

Minkowski distance (cont.)

Note that the Minkowski distance can be defined as follows: \[ d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p,\:p\geq 1. \] # Minkowski inequality

The proof of the triangular inequality P3 is based on the Minkowski inequality which states that for any nonnegative real numbers \(a_1,\cdots,a_n\); \(b_1,\cdots,b_n\), we have:

\[ \left[\sum_{i=1}^n (a_i+b_i)^{p}\right]^{1/p}\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} + \left[\sum_{i=1}^n b_i^{p}\right]^{1/p},\:p\geq 1. \] # Minkowski inequality proof

To prove that the Minkowski distance satisfies P3, notice that

\[ \sum_{i=1}^n|x_i-z_i|^{p}= \sum_{i=1}^n|(x_i-y_i)+(y_i-z_i)|^{p}. \] Noticing then that for any reals \(x,y\), we have \(|x+y|\leq |x|+|y|\), and using the fact that \(a^p\) is increasing in \(a>0\), we obtain

\[ \sum_{i=1}^n|x_i-z_i|^{p}\leq \sum_{i=1}^n(|x_i-y_i|+|y_i-z_i|)^{p}. \] Applying then the Minkowski inequality to the RHS of the above inequality by posing \(a_i=|x_i-y_i|\) and \(b_i=|y_i-z_i|\), \(i=1,\cdots,n\), we get \[ \sum_{i=1}^n|x_i-z_i|^{p}\leq \left(\sum_{i=1}^n |x_i-y_i|^{p}\right)^{1/p}+\left(\sum_{i=1}^n |y_i-z_i|^{p}\right)^{1/p}. \]

Hölder inequality

The proof of the Minkowski inequality itself requires the Hölder inequality which states that for any nonnegative real numbers \(a_1,\cdots,a_n\); \(b_1,\cdots,b_n\), and any \(p,q>1\) with \(1/p+1/q=1\), we have

\[ \sum_{i=1}^n a_ib_i\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} \left[\sum_{i=1}^n b_i^{q}\right]^{1/q} \] # Young’s inequality

The proof of the above displayed Hölder inequality relies on the Young’s inequality: For any \(a,b>0\) we have \[ ab\leq \frac{a^p}{p}+\frac{b^q}{q}, \] with equality occuring iff: \(a^p=b^q\). To prove the Young’s inequality, one can use the (strict) convexity of the exponential function which tels us that for any reals \(x,y\), then

\[ e^{\frac{x}{p}+\frac{y}{q} }\leq \frac{e^{x}}{p}+\frac{e^{y}}{q}. \] We then set \(x=p\ln a\) and \(y=q\ln b\) to get the Young’s inequality. A good reference on the inequalities topic is: Z. Cvetkovski, Inequalities: theorems, techniques and selected problems, 2012, Springer Science & Business Media.

Note that the triangular inequality for the Minkowski distance implies

\[ \sum_{i=1}^n |x_i|\leq \left[\sum_{i=1}^n |x_i|^{p}\right]^{1/p} ,\:p\geq 1. \]

Note that for \(p=2\), we have \(q=2\). The Hölder inequality implies for that special case \[ \sum_{i=1}^n|x_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \]

Since the LHS od thes above inequality is greater then \(|\sum_{i=1}^nx_iy_i|\), we get the Cauchy-Schwartz inequality \[ |\sum_{i=1}^nx_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \] Using the dot product noation called also scalar product noation \(\mathbf{x\cdot y}=\sum_{i=1}^nx_iy_i\), and the norm notation \(\|\mathbf{\cdot}\|_2 \|\), the Cauchy-Schwart inequality is:

\[|\mathbf{x\cdot y} | \leq \|\mathbf{x}\|_2 \| \mathbf{y}\|_2 \]

Correlation-based distances

Pearson correlation

The Pearson correlation coefficient is a similarity measure on \(\mathbb{R}^n\) defined by: \[ \rho(\mathbf{x},\mathbf{y})= \frac{\sum_{i=1}^n (x_i-\bar{\mathbf{x}})(y_i-\bar{\mathbf{y}})}{{\sqrt{\sum_{i=1}^n (x_i-\bar{\mathbf{x}})^2\sum_{i=1}^n (y_i-\bar{\mathbf{y}})^2}}}, \] where \(\bar{\mathbf{x}}\) is the mean of the vector \(\mathbf{x}\) defined by: \[\bar{\mathbf{x}}=\frac{1}{n}\sum_{i=1}^n x_i,\]

Pearson correlation (cont.)

Note that the Pearson correlation coefficient satisfies P2 and is invariant to any positive linear transformation, i.e.: \(\rho(\alpha\mathbf{x},\mathbf{y})=\rho(\mathbf{x},\mathbf{y})\), for any \(\alpha>0\). # Pearson distance

The Pearson distance (or correlation distance) is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x},\mathbf{y}).\] Note that the Pearson distance does not satisfy \(P1\) since \(d(\mathbf{x},\mathbf{x})=0\) for any non-zero vector \(\mathbf{x}\). It neither satisfies the triangle inequality. However, the symmetry property is fullfilled.

Cosine of two vecrors

The cosine of the angle \(\theta\) between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is a measure of similarity given by: \[ \cos(\theta)=\frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|_2\|\mathbf{y}\|_2}=\frac{\sum_{i=1}^n x_i y_i}{{\sqrt{\sum_{i=1}^n x_i^2\sum_{i=1}^n y_i^2}}}. \]

Cosine of two vecrors (cont.)

Note that the cosine of the angle between the two centred vectors \((x_1-\bar{\mathbf{x}},\cdots,x_n-\bar{\mathbf{x}})\) and \((y_1-\bar{\mathbf{y}},\cdots,y_n-\bar{\mathbf{y}})\) coincides with the Pearson correlation coefficient of \(\mathbf{x}\) and \(\mathbf{y}\).

Cosine correlation distance

The cosine correlation distance is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\cos(\theta). \] It shares similar properties than the Pearson correlation distance. Likewise, Axioms P1 and P3 are not satisfied.

Spearman correlation

To calculate the Spearman rank-order correlation or Spearman correlation coefficient for short, we need to map seperately each of the vectors to ranked data values: \(\mathbf{x}\rightarrow \mathbf{x}^r=(x_1^r,\cdots,x_n^r)\). Here, \(x_i^r\) is the rank of \(x_i\) among the set of values of \(\mathbf{x}\).

Spearman correlation (cont.)

We illustrate this transformation with a simple example. If \(\mathbf{x}=(3, 1, 4, 15, 92)\), then the rank-order vector is \(\mathbf{x}^r=(2,1,3,4,5)\).

Spearman correlation (cont.)

The Spearman correlation of two numerical variables \(\mathbf{x}\) and \(\mathbf{y}\) is simply the Pearson correlation of the two correspnding rank-oerder variables \(\mathbf{x}^r\) and \(\mathbf{y}^r\), i.e. \(\rho(\mathbf{x}^r,\mathbf{y}^r)\). This measure is is useful because it is more robust against outliers than the Pearson correlation.

Spearman correlation (cont.)

If all the \(n\) ranks are distinct, it can be computed using the following formula:

\[ \rho(\mathbf{x}^r,\mathbf{y}^r)=1-\frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \] where \(d_i=x_i^r-y_i^r,\:i=1,\cdots,n\).

Spearman distance

The spearman distance is then defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x^r},\mathbf{y^r}). \] It can be shown that easaly that it is not a proper distance.

If all the \(n\) ranks are distinct, we get: \[ d(\mathbf{x},\mathbf{y})=\frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}. \] # Example

As an example, consider these data

## [1] -2  0  1  1  0
## [1] 0.7
## [1] 0.7

Kendall correlation

The Kendall rank correlation coefficient is calculated from the number of correspondances between the rankings of \(\mathbf{x}\) and the rankings of \(\mathbf{y}\).

There are totally \({n \choose 2} =n(n-1)/2\) number of pairs. For any pair \(i,j=1\cdots,n,\:i<j\), the observations \((x_{i},x_{j})\) and \((y_{i},y_{j})\) are said concordant if the signs of \(x_{i}-x_{j}\) and \(y_{i}-y_{j}\) are the same. Otherwise they are said to be discordant. Denoting by \(n_c\) the number of concordant pairs and by \(n_d\) the number of discordant pairs, the Kendall \(\tau\) coefficient is defined as: \[\tau =\frac {n_c-n_d}{n(n-1)/2}.\]