Complete all Exercises, and submit answers to VtopBeta

Introduction

Datasets

Using the iris dataset to compare the various Distance Measures found in Clustering – Euclidean distance, Manhattan Distance, Jaccard coefficient, Cosine distance and edit.

Iris dataset for finding distance matrices
Petal.Length Petal.Width Species
1.4 0.2 setosa
1.4 0.2 setosa
1.3 0.2 setosa
1.5 0.2 setosa
1.4 0.2 setosa

Dist function

The dist function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

# Euclidean Distance
euclid_dist <- as.matrix(dist(iris[, 3:4], 
                              method = "euclidean", 
                              upper = TRUE, 
                              diag = TRUE))
euclid_dist[1:10,1:7]
##            1         2         3         4         5         6         7
## 1  0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 2  0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 3  0.1000000 0.1000000 0.0000000 0.2000000 0.1000000 0.4472136 0.1414214
## 4  0.1000000 0.1000000 0.2000000 0.0000000 0.1000000 0.2828427 0.1414214
## 5  0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 6  0.3605551 0.3605551 0.4472136 0.2828427 0.3605551 0.0000000 0.3162278
## 7  0.1000000 0.1000000 0.1414214 0.1414214 0.1000000 0.3162278 0.0000000
## 8  0.1000000 0.1000000 0.2000000 0.0000000 0.1000000 0.2828427 0.1414214
## 9  0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 10 0.1414214 0.1414214 0.2236068 0.1000000 0.1414214 0.3605551 0.2236068
# Manhattan Distance
man_dist <- as.matrix(dist(iris[, 3:4], 
                              method = "manhattan", 
                              upper = TRUE, 
                              diag = TRUE))
man_dist[1:10,1:10]
##      1   2   3   4   5   6   7   8   9  10
## 1  0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 2  0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 3  0.1 0.1 0.0 0.2 0.1 0.6 0.2 0.2 0.1 0.3
## 4  0.1 0.1 0.2 0.0 0.1 0.4 0.2 0.0 0.1 0.1
## 5  0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 6  0.5 0.5 0.6 0.4 0.5 0.0 0.4 0.4 0.5 0.5
## 7  0.1 0.1 0.2 0.2 0.1 0.4 0.0 0.2 0.1 0.3
## 8  0.1 0.1 0.2 0.0 0.1 0.4 0.2 0.0 0.1 0.1
## 9  0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 10 0.2 0.2 0.3 0.1 0.2 0.5 0.3 0.1 0.2 0.0

Distance function

The distance() function implemented in philentropy is able to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance for details).

library(philentropy)

# Cosine distance
cos_dist <- as.matrix(distance(iris[,3:4], 
                               method = "cosine"))
cos_dist[1:6,1:7]
##           v1        v2        v3        v4        v5        v6        v7
## v1 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v2 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v3 0.9999422 0.9999422 1.0000000 0.9997980 0.9999422 0.9969251 0.9982926
## v4 0.9999563 0.9999563 0.9997980 1.0000000 0.9999563 0.9951489 0.9969172
## v5 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v6 0.9960249 0.9960249 0.9969251 0.9951489 0.9960249 1.0000000 0.9998001
# Jaccard similarity
jac_sim <- as.matrix(distance(iris[,3:4], 
                               method = "jaccard"))
jac_sim[1:6,1:6]
##             v1          v2          v3          v4          v5         v6
## v1 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v2 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v3 0.005347594 0.005347594 0.000000000 0.019704433 0.005347594 0.08032129
## v4 0.004651163 0.004651163 0.019704433 0.000000000 0.004651163 0.02952030
## v5 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v6 0.050193050 0.050193050 0.080321285 0.029520295 0.050193050 0.00000000

Approximate String Distances

Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

#Edit Distance
edit_dist <- as.matrix(adist(iris[,5])) #y = NULL (default) indicating taking x as y
edit_dist[48:60,1:12]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
##  [1,]    0    0    0    0    0    0    0    0    0     0     0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0     0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0     0     0
##  [4,]    8    8    8    8    8    8    8    8    8     8     8     8
##  [5,]    8    8    8    8    8    8    8    8    8     8     8     8
##  [6,]    8    8    8    8    8    8    8    8    8     8     8     8
##  [7,]    8    8    8    8    8    8    8    8    8     8     8     8
##  [8,]    8    8    8    8    8    8    8    8    8     8     8     8
##  [9,]    8    8    8    8    8    8    8    8    8     8     8     8
## [10,]    8    8    8    8    8    8    8    8    8     8     8     8
## [11,]    8    8    8    8    8    8    8    8    8     8     8     8
## [12,]    8    8    8    8    8    8    8    8    8     8     8     8
## [13,]    8    8    8    8    8    8    8    8    8     8     8     8