1. 欧几里得距离

其意义就是两个元素在欧氏空间中的集合距离,因为其直观易懂且可解释性强,被广泛用于标识两个标量元素的相异度。

欧几里得距离通常需要特征都是数值的!

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
a <- iris[,-5]
dist(a) %>% head() # 计算的是不同行之间的距离
#> [1] 0.5385165 0.5099020 0.6480741 0.1414214 0.6164414 0.5196152

2. 切比雪夫距离

国际象棋中,国王可以直行、横行、斜行,所以国王走一步可以移动到相邻8个方格中的任意一个。国王从格子(x1,y1)走到格子(x2,y2)最少需要多少步?这个距离就叫切比雪夫距离。

公式:

 
a <- iris[,-5]
dist(a,method = "maximum") %>% head() 
#> [1] 0.5 0.4 0.5 0.1 0.4 0.5

3. 曼哈顿距离

顾名思义,在曼哈顿街区要从一个十字路口开车到另一个十字路口,驾驶距离显然不是两点间的直线距离。这个实际驾驶距离就是“曼哈顿距离”。曼哈顿距离也称为“城市街区距离”(City Block distance)。

a <- iris[,-5]
dist(a,method = "manhattan") %>% head() 
#> [1] 0.7 0.8 1.0 0.2 1.2 0.7

4. 兰氏距离

兰氏距离对数据的量纲不敏感。不过兰氏距离假定变量之间相互独立,没有考虑变量之间的相关性。

a <- iris[,-5]
dist(a,method = "canberra") %>% head() 
#> [1] 0.09692308 0.12262948 0.14663521 0.02398550 0.51273301 0.26603915

5. 闵科夫斯基距离(明氏距离)

闵可夫斯基距离不是一种距离,而是一组距离的定义。两个n维变量a(a1;a2;…;an)与b(b1;b2;…;bn)间的闵可夫斯基距离的定义为:

其中p为一个变参数

a <- iris[,-5]
dist(a,method = "minkowski") %>% head() 
#> [1] 0.5385165 0.5099020 0.6480741 0.1414214 0.6164414 0.5196152

6. 二元变量

x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
dist(rbind(x, y), method = "binary")
#>     x
#> y 0.4
## answer 0.4 = 2/5
dist(rbind(x, y), method = "canberra")
#>     x
#> y 2.4

7. cluster包中的daisy函数

可以处理连续变量和离散变量。

library(cluster)

 
as.matrix(daisy(iris[1:5,], metric = "gower"))
#>            1         2         3         4          5
#> 1 0.00000000 0.2466667 0.3600000 0.4333333 0.07333333
#> 2 0.24666667 0.0000000 0.2466667 0.2533333 0.24000000
#> 3 0.36000000 0.2466667 0.0000000 0.2733333 0.35333333
#> 4 0.43333333 0.2533333 0.2733333 0.0000000 0.42666667
#> 5 0.07333333 0.2400000 0.3533333 0.4266667 0.00000000

这个包还提供了非常多的聚类算法。

  1. Agglomerative Nesting (Hierarchical Clustering) agnes
  2. Clustering Large Applications-clara
  3. Gap Statistic for Estimating the Number of Clusters - clusGap
  4. DIvisive ANAlysis Clustering - diana
  5. Fuzzy Analysis Clustering -fanny
  6. Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.-pam

用于判断聚类数目的方法

  1. Gap Statistic for Estimating the Number of Clusters - maxSE

8. SimilarityMeasures 包

这个包是对轨迹进行相似性度量Trajectory Similarity Measures。

轨迹相似性对于移动对象分析来说是一个重要的指标,如何度量轨迹相似性,则是中心问题。与一般数据类似的是,两个轨迹之间的相似性通常是由轨迹点之间距离的某种集合来衡量的.

轨迹数据通常是一个2列的矩阵。

1. Calculate the Square Distance Between Two Points

表示两点之间距离的平方。

library(SimilarityMeasures)
point1 <- c(0, 2, 4, 6)
point2 <- c(0, 1, 2, 3)

# Calculating the square distance between the two points 
# in 4 dimensions.
DistanceSq(point1, point2, 4)
#> [1] 14

2. Run the Edit Distance Algorithm on Two Trajectories

path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)

# Running the edit distance algorithm with point distance 
# set to 2.
EditDist(path1, path2, 2)
#> [1] 4

3. Frechet Run the Frechet Calculation Algorithm on Two Trajectories

# Creating two trajectories.
path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)

# Running the Frechet distance algorithm.
Frechet(path1, path2)
#> [1] 4

4. Run the LCSS Algorithm on Two Trajectories Allowing Translations

# Creating two trajectories.
path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)

# Running the LCSS algorithm on the trajectories.
LCSS(path1, path2, 2, 2, 0.5)
#> [1] 4

9. proxy 包

查看可用的方法:

library(proxy)
#> 
#> Attaching package: 'proxy'
#> The following objects are masked from 'package:stats':
#> 
#>     as.dist, dist
#> The following object is masked from 'package:base':
#> 
#>     as.matrix
summary(pr_DB)
#> * Similarity measures:
#> angular, Braun-Blanquet, Chi-squared, correlation, cosine, Cramer,
#> Dice, eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
#> Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai, Pearson,
#> Phi, Phi-squared, Russel, simple matching, Simpson, Stiles, Tanimoto,
#> Tschuprow, Yule, Yule2
#> 
#> * Distance measures:
#> Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean, fJaccard,
#> Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis, Manhattan,
#> Minkowski, Podani, Soergel, supremum, Wave, Whittaker

获取更多的信息

### get more information about a particular one
pr_DB$get_entry("Minkowski")
#>       names Minkowski, Lp
#>         FUN R_minkowski_dist
#>    distance TRUE
#>      PREFUN pr_Minkowski_prefun
#>     POSTFUN NA
#>     convert pr_dist2simil
#>        type metric
#>        loop FALSE
#>       C_FUN TRUE
#>     PACKAGE proxy
#>        abcd FALSE
#>     formula (sum_i (x_i - y_i)^p)^(1/p)
#>   reference Cox, T.F., and Cox, M.A.A. (2001. Multidimensional Scaling.
#>             Chapmann and Hall.
#> description The Minkowski Distance (C implementation with compensation
#>             for excluded components)

计算相似

simil(iris[,-5],method = "angular") %>% as.matrix() %>% head(2)
#>           1         2         3         4         5         6         7
#> 1        NA 0.9830297 0.9983988 0.9864988 0.9929924 0.9857816 0.9863416
#> 2 0.9830297        NA 0.9843491 0.9843651 0.9763486 0.9750367 0.9725351
#>           8        9        10        11        12        13        14
#> 1 0.9933168 0.986761 0.9841755 0.9989573 0.9855685 0.9846035 0.9930908
#> 2 0.9851632 0.987761 0.9916033 0.9834635 0.9777169 0.9938800 0.9782626
#>          15        16        17        18        19        20        21
#> 1 0.9828032 0.9803846 0.9851096 0.9949888 0.9914763 0.9867918 0.9828618
#> 2 0.9718794 0.9642312 0.9710386 0.9825171 0.9876802 0.9715024 0.9921042
#>          22        23        24        25        26        27        28
#> 1 0.9869477 0.9741827 0.9764484 0.9710550 0.9762942 0.9845258 0.9951870
#> 2 0.9741813 0.9580035 0.9804458 0.9689326 0.9904312 0.9813561 0.9865919
#>          29        30        31        32        33        34        35
#> 1 0.9930478 0.9834912 0.9817639 0.9843587 0.9787663 0.9825811 0.9848879
#> 2 0.9892400 0.9813546 0.9868892 0.9901833 0.9625006 0.9656629 0.9936099
#>          36        37        38        39        40        41        42
#> 1 0.9875766 0.9864303 0.9888018 0.9943369 0.9930673 0.9927684 0.9582555
#> 2 0.9859731 0.9854883 0.9730000 0.9848185 0.9880384 0.9783597 0.9747957
#>          43        44        45        46        47        48        49
#> 1 0.9906166 0.9769957 0.9737340 0.9840032 0.9857631 0.9925909 0.9969415
#> 2 0.9753911 0.9723538 0.9665975 0.9936207 0.9717843 0.9818579 0.9810418
#>          50        51        52        53        54        55        56
#> 1 0.9939616 0.8787985 0.8767640 0.8699271 0.8611813 0.8665980 0.8618849
#> 2 0.9890510 0.8891779 0.8857675 0.8800797 0.8717046 0.8770543 0.8706692
#>          57        58        59        60        61        62        63
#> 1 0.8710546 0.8828167 0.8720593 0.8686078 0.8652064 0.8747268 0.8669261
#> 2 0.8792161 0.8924188 0.8826212 0.8767809 0.8764698 0.8833765 0.8792391
#>          64        65        66        67        68        69        70
#> 1 0.8628367 0.8893923 0.8814207 0.8621151 0.8757166 0.8509198 0.8732403
#> 2 0.8720305 0.8982969 0.8917362 0.8697672 0.8857771 0.8624840 0.8836673
#>          71        72        73        74        75        76        77
#> 1 0.8583561 0.8807051 0.8500239 0.8626615 0.8778953 0.8782948 0.8657454
#> 2 0.8656949 0.8910803 0.8605736 0.8723120 0.8883378 0.8886894 0.8767163
#>          78        79        80        81        82        83        84
#> 1 0.8611908 0.8656952 0.8902768 0.8724049 0.8766608 0.8793496 0.8446785
#> 2 0.8709422 0.8747360 0.9013573 0.8830589 0.8875703 0.8894959 0.8538043
#>          85        86        87        88        89        90        91
#> 1 0.8588690 0.8741404 0.8727953 0.8592435 0.8764762 0.8664927 0.8584279
#> 2 0.8659967 0.8812316 0.8826930 0.8710732 0.8845189 0.8763204 0.8675625
#>          92        93        94        95        96        97        98
#> 1 0.8676405 0.8739236 0.8811596 0.8668719 0.8761385 0.8727477 0.8758999
#> 2 0.8765825 0.8843065 0.8915685 0.8760260 0.8844786 0.8814127 0.8858268
#>          99       100       101       102       103       104       105
#> 1 0.8964148 0.8734263 0.8295873 0.8381176 0.8417902 0.8390350 0.8350159
#> 2 0.9065453 0.8825743 0.8367046 0.8466236 0.8513442 0.8477548 0.8435036
#>         106       107       108       109       110       111       112
#> 1 0.8339144 0.8364510 0.8368224 0.8316990 0.8451486 0.8567023 0.8420301
#> 2 0.8438151 0.8440218 0.8468239 0.8419688 0.8530544 0.8651199 0.8516031
#>         113       114       115       116       117       118       119
#> 1 0.8464943 0.8332577 0.8329866 0.8467428 0.8460511 0.8458024 0.8205222
#> 2 0.8557646 0.8421141 0.8408265 0.8546356 0.8549517 0.8540089 0.8310123
#>         120       121       122       123       124       125       126
#> 1 0.8364867 0.8449269 0.8404612 0.8302110 0.8511738 0.8465187 0.8475273
#> 2 0.8471114 0.8536103 0.8482617 0.8406293 0.8609328 0.8546967 0.8568925
#>         127       128       129       130       131       132       133
#> 1 0.8545021 0.8549023 0.8351637 0.8496409 0.8391304 0.8556415 0.8339724
#> 2 0.8638349 0.8633284 0.8441457 0.8597485 0.8496029 0.8644329 0.8428858
#>         134       135       136       137       138       139       140
#> 1 0.8519413 0.8335717 0.8424384 0.8408979 0.8466156 0.8559859 0.8517638
#> 2 0.8614765 0.8428927 0.8527065 0.8478738 0.8550402 0.8642394 0.8610249
#>         141       142       143       144       145       146       147
#> 1 0.8412926 0.8553618 0.8381176 0.8395422 0.8415345 0.8489861 0.8430618
#> 2 0.8498406 0.8646212 0.8466236 0.8479531 0.8494488 0.8580733 0.8531901
#>         148       149       150
#> 1 0.8505477 0.8450978 0.8470100
#> 2 0.8594950 0.8519706 0.8548892

10. 字符串距离

Levenshtein Distance/edit distance,这种距离就是描述两个字符有哪些不同。

edit distance is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

上面这两个字符的距离是2!

R中的adist函数可以实现:

adist("exactly the same","exactly the same") # edit distance 0 
#>      [,1]
#> [1,]    0

adist("exactly the same","totally different") # edit distance 14
#>      [,1]
#> [1,]   14

另外stringdist包中的stringdist函数也可以实现!

library(stringdist)

#calculate Levenshtein distance between two strings
stringdist("exactly the same", "totally different", method = "lv")
#> [1] 14

stringdist这个包还提供了很多其他的字符串处理工具,例如字符串匹配,模糊搜索。

这个包提供的距离,并且有详细介绍。包括:

  1. osa : osa Optimal string aligment, (restricted Damerau-Levenshtein distance).类似于Levenshtein距离,但也允许相邻字符的换位。这里,每个子字符串只能编辑一次。
  2. lv :edit distance.统计将b转换为a所需的删除、插入和替换次数。
  3. dl :Full Damerau-Levenshtein distance.类似osa,但是允许编辑多次。
  4. hamming Hamming distance (a and b must have same nr of characters). 统计将b转换为a的字符替换数。如果字符串长度不一样,返回inf
  5. lcs Longest common substring distance.被定义为通过将a和b中的字符配对而获得的最长字符串,同时保持字符顺序不变。lcs距离定义为未成对字符的数量。该距离相当于仅允许删除和插入的编辑距离,每个编辑距离的权重为1。
  6. qgram q-gram distance. q-gram是词带的概念,首先将字符串/序列转变成为长度为q的字符集合。这种距离就是对词袋集合中不相同的元素集合。可以参考这篇文章。https://riino.site/2021/02/26/Q-Gram.html
  7. cosine cosine distance between qq-gram profiles 距离是余弦距离的一种实现,特别适用于模糊文本搜索
  8. jaccard Jaccard distance between qq-gram profiles
  9. jw -Jaro, or Jaro-Winkler distance.
  10. soundex Distance based on soundex encoding (see below) 这种是首先将字符转变成为北约音标字母,然后再计算距离。

计算出距离矩阵之后,可以使用任意的聚类算法了。

具体需要看了看论文!