Complete all Exercises, and submit answers to VtopBeta
Using the iris dataset to compare the various Distance Measures found in Clustering – Euclidean distance, Manhattan Distance, Jaccard coefficient, Cosine distance and edit.
| Petal.Length | Petal.Width | Species |
|---|---|---|
| 1.4 | 0.2 | setosa |
| 1.4 | 0.2 | setosa |
| 1.3 | 0.2 | setosa |
| 1.5 | 0.2 | setosa |
| 1.4 | 0.2 | setosa |
The dist function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
# Euclidean Distance
euclid_dist <- as.matrix(dist(iris[, 3:4],
method = "euclidean",
upper = TRUE,
diag = TRUE))
euclid_dist[1:10,1:7]## 1 2 3 4 5 6 7
## 1 0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 2 0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 3 0.1000000 0.1000000 0.0000000 0.2000000 0.1000000 0.4472136 0.1414214
## 4 0.1000000 0.1000000 0.2000000 0.0000000 0.1000000 0.2828427 0.1414214
## 5 0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 6 0.3605551 0.3605551 0.4472136 0.2828427 0.3605551 0.0000000 0.3162278
## 7 0.1000000 0.1000000 0.1414214 0.1414214 0.1000000 0.3162278 0.0000000
## 8 0.1000000 0.1000000 0.2000000 0.0000000 0.1000000 0.2828427 0.1414214
## 9 0.0000000 0.0000000 0.1000000 0.1000000 0.0000000 0.3605551 0.1000000
## 10 0.1414214 0.1414214 0.2236068 0.1000000 0.1414214 0.3605551 0.2236068
# Manhattan Distance
man_dist <- as.matrix(dist(iris[, 3:4],
method = "manhattan",
upper = TRUE,
diag = TRUE))
man_dist[1:10,1:10]## 1 2 3 4 5 6 7 8 9 10
## 1 0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 2 0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 3 0.1 0.1 0.0 0.2 0.1 0.6 0.2 0.2 0.1 0.3
## 4 0.1 0.1 0.2 0.0 0.1 0.4 0.2 0.0 0.1 0.1
## 5 0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 6 0.5 0.5 0.6 0.4 0.5 0.0 0.4 0.4 0.5 0.5
## 7 0.1 0.1 0.2 0.2 0.1 0.4 0.0 0.2 0.1 0.3
## 8 0.1 0.1 0.2 0.0 0.1 0.4 0.2 0.0 0.1 0.1
## 9 0.0 0.0 0.1 0.1 0.0 0.5 0.1 0.1 0.0 0.2
## 10 0.2 0.2 0.3 0.1 0.2 0.5 0.3 0.1 0.2 0.0
The distance() function implemented in philentropy is able to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance for details).
library(philentropy)
# Cosine distance
cos_dist <- as.matrix(distance(iris[,3:4],
method = "cosine"))
cos_dist[1:6,1:7]## v1 v2 v3 v4 v5 v6 v7
## v1 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v2 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v3 0.9999422 0.9999422 1.0000000 0.9997980 0.9999422 0.9969251 0.9982926
## v4 0.9999563 0.9999563 0.9997980 1.0000000 0.9999563 0.9951489 0.9969172
## v5 1.0000000 1.0000000 0.9999422 0.9999563 1.0000000 0.9960249 0.9976069
## v6 0.9960249 0.9960249 0.9969251 0.9951489 0.9960249 1.0000000 0.9998001
# Jaccard similarity
jac_sim <- as.matrix(distance(iris[,3:4],
method = "jaccard"))
jac_sim[1:6,1:6]## v1 v2 v3 v4 v5 v6
## v1 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v2 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v3 0.005347594 0.005347594 0.000000000 0.019704433 0.005347594 0.08032129
## v4 0.004651163 0.004651163 0.019704433 0.000000000 0.004651163 0.02952030
## v5 0.000000000 0.000000000 0.005347594 0.004651163 0.000000000 0.05019305
## v6 0.050193050 0.050193050 0.080321285 0.029520295 0.050193050 0.00000000
Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
#Edit Distance
edit_dist <- as.matrix(adist(iris[,5])) #y = NULL (default) indicating taking x as y
edit_dist[48:60,1:12]## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 8 8 8 8 8 8 8 8 8 8 8 8
## [5,] 8 8 8 8 8 8 8 8 8 8 8 8
## [6,] 8 8 8 8 8 8 8 8 8 8 8 8
## [7,] 8 8 8 8 8 8 8 8 8 8 8 8
## [8,] 8 8 8 8 8 8 8 8 8 8 8 8
## [9,] 8 8 8 8 8 8 8 8 8 8 8 8
## [10,] 8 8 8 8 8 8 8 8 8 8 8 8
## [11,] 8 8 8 8 8 8 8 8 8 8 8 8
## [12,] 8 8 8 8 8 8 8 8 8 8 8 8
## [13,] 8 8 8 8 8 8 8 8 8 8 8 8