knitr::opts_chunk$set(echo = TRUE)

Definition of a distance

Exercice 1

Euclidean distance

\[d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{i=1}^n (x_i-y_i)^2}.\] * A1 and A2 are onbvious. The proof of A3 is provided below.

Manhattan distance

\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n |x_i-y_i|. \]

Manhattan distance vs Euclidean distance Graph

x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "euclidian")
6*sqrt(2)
dist(rbind(x, y), method = "manhattan")

Canberra distance

\[d(\mathbf{x},\mathbf{y}) =\sum_{i=1}^n \frac{|x_i-y_i|}{|x_i|+|y_i|}.\]

x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "canberra")
6/6+6/6

Exercice 2

Minkowski distance

\[ d(\mathbf{x},\mathbf{y})= \left[\sum_{i=1} |x_i-y_i|^{p}\right]^{1/p}. \] * For \(p=1\), we get the Manhattan distance. * For \(p=2\), we get the Euclidian ditance.

Chebyshev distance

\[ d(\mathbf{x},\mathbf{y})=\max_{i=1,\cdots,n}(|x_i-y_i|)=\lim_{p\rightarrow\infty} \left[\sum_{i=1} |x_i-y_i|^{p}\right]^{1/p}. \]

Minkowski inequality

\[ \left[\sum_{i=1}^n (a_i+b_i)^{p}\right]^{1/p}\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} + \left[\sum_{i=1}^n b_i^{p}\right]^{1/p}. \]

\[ \sum_{i=1}^n|x_i-z_i|^{p}= \sum_{i=1}^n|(x_i-y_i)+(y_i-z_i)|^{p}. \] * Since for any reals \(x,y\), we have \(|x+y|\leq |x|+|y|\), and using the fact that \(x^p\) is increasing in \(x>0\), we obtain: \[ \sum_{i=1}^n|x_i-z_i|^{p}\leq \sum_{i=1}^n(|x_i-y_i|+|y_i-z_i|)^{p}. \]

Hölder inequality

\[ \sum_{i=1}^n a_ib_i\leq \left[\sum_{i=1}^n a_i^{p}\right]^{1/p} \left[\sum_{i=1}^n b_i^{q}\right]^{1/q} \] * The proof of the Hölder inequality relies on the Young inequality. * For any \(a,b>0\), we have \[ ab\leq \frac{a^p}{p}+\frac{b^q}{q}, \] with equality occuring iff: \(a^p=b^q\). * To prove the Young inequality, one can use the (strict) convexity of the exponential function. * For any reals \(x,y\), then \[ e^{\frac{x}{p}+\frac{y}{q} }\leq \frac{e^{x}}{p}+\frac{e^{y}}{q}. \] * We then set \(x=p\ln a\) and \(y=q\ln b\) to get the Young inequality. * A good reference on the inequalities topic is: Z. Cvetkovski, Inequalities: theorems, techniques and selected problems, 2012, Springer Science & Business Media. # Cauchy-Schwartz inequality * Note that the triangular inequality for the Minkowski distance implies

\[ \sum_{i=1}^n |x_i|\leq \left[\sum_{i=1}^n |x_i|^{p}\right]^{1/p}. \] * Note that for \(p=2\), we have \(q=2\). The Hölder inequality implies for that special case \[ \sum_{i=1}^n|x_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \] * Since the LHS od thes above inequality is greater then \(|\sum_{i=1}^nx_iy_i|\), we get the Cauchy-Schwartz inequality

\[ |\sum_{i=1}^nx_iy_i|\leq\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}. \] * Using the dot product notation called also scalar product noation \(\mathbf{x\cdot y}=\sum_{i=1}^nx_iy_i\), and the norm notation \(\|\mathbf{\cdot}\|_2 \|\), the Cauchy-Schwart inequality is:

\[|\mathbf{x\cdot y} | \leq \|\mathbf{x}\|_2 \| \mathbf{y}\|_2.\]

Pearson correlation distance

Cosine correlation distance

Spearman correlation distance

x=c(3, 1, 4, 15, 92)
xr=rank(x)
xr
x=c(3, 1, 4, 15, 92)
xr=rank(x)
y=c(30,2 , 9, 20, 48)
yr=rank(y)
d=xr-yr
d
cor(xr,yr)
1-6*sum(d^2)/(5*(5^2-1))

Kendall tau distance

x=c(3, 1, 4, 15, 92)
y=c(30,2 , 9, 20, 48)
tau=0
for (i in 1:5)
{  
tau=tau+sign(x -x[i])%*%sign(y -y[i])
}
tau=tau/(5*4)
tau
cor(x,y, method="kendall")

Standardization of variables

x=c(3, 1, 4, 15, 92)
y=c(30,2 , 9, 20, 48)
(x-mean(x))/sd(x)
scale(x)
(y-mean(y))/sd(y)
scale(y)

Distance matrix computation

install.packages("FactoMineR")
library("FactoMineR")
data("USArrests") # Loading
head(USArrests, 3) # Print the first 3 rows
set.seed(123)
ss <- sample(1:50, 15) # Take 15 random rows
df <- USArrests[ss, ] # Subset the 15 rows
df.scaled <- scale(df) # Standardize the variables
