其意义就是两个元素在欧氏空间中的集合距离,因为其直观易懂且可解释性强,被广泛用于标识两个标量元素的相异度。
欧几里得距离通常需要特征都是数值的!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
a <- iris[,-5]
dist(a) %>% head() # 计算的是不同行之间的距离
#> [1] 0.5385165 0.5099020 0.6480741 0.1414214 0.6164414 0.5196152
国际象棋中,国王可以直行、横行、斜行,所以国王走一步可以移动到相邻8个方格中的任意一个。国王从格子(x1,y1)走到格子(x2,y2)最少需要多少步?这个距离就叫切比雪夫距离。
公式:
a <- iris[,-5]
dist(a,method = "maximum") %>% head()
#> [1] 0.5 0.4 0.5 0.1 0.4 0.5
顾名思义,在曼哈顿街区要从一个十字路口开车到另一个十字路口,驾驶距离显然不是两点间的直线距离。这个实际驾驶距离就是“曼哈顿距离”。曼哈顿距离也称为“城市街区距离”(City Block distance)。
a <- iris[,-5]
dist(a,method = "manhattan") %>% head()
#> [1] 0.7 0.8 1.0 0.2 1.2 0.7
兰氏距离对数据的量纲不敏感。不过兰氏距离假定变量之间相互独立,没有考虑变量之间的相关性。
a <- iris[,-5]
dist(a,method = "canberra") %>% head()
#> [1] 0.09692308 0.12262948 0.14663521 0.02398550 0.51273301 0.26603915
闵可夫斯基距离不是一种距离,而是一组距离的定义。两个n维变量a(a1;a2;…;an)与b(b1;b2;…;bn)间的闵可夫斯基距离的定义为:
其中p为一个变参数
当p=1时,就是曼哈顿距离;
当p=2时,就是欧式距离;
当p→∞时,就是切比雪夫距离;
a <- iris[,-5]
dist(a,method = "minkowski") %>% head()
#> [1] 0.5385165 0.5099020 0.6480741 0.1414214 0.6164414 0.5196152
x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
dist(rbind(x, y), method = "binary")
#> x
#> y 0.4
## answer 0.4 = 2/5
dist(rbind(x, y), method = "canberra")
#> x
#> y 2.4
可以处理连续变量和离散变量。
library(cluster)
as.matrix(daisy(iris[1:5,], metric = "gower"))
#> 1 2 3 4 5
#> 1 0.00000000 0.2466667 0.3600000 0.4333333 0.07333333
#> 2 0.24666667 0.0000000 0.2466667 0.2533333 0.24000000
#> 3 0.36000000 0.2466667 0.0000000 0.2733333 0.35333333
#> 4 0.43333333 0.2533333 0.2733333 0.0000000 0.42666667
#> 5 0.07333333 0.2400000 0.3533333 0.4266667 0.00000000
这个包还提供了非常多的聚类算法。
用于判断聚类数目的方法
这个包是对轨迹进行相似性度量Trajectory Similarity Measures。
轨迹相似性对于移动对象分析来说是一个重要的指标,如何度量轨迹相似性,则是中心问题。与一般数据类似的是,两个轨迹之间的相似性通常是由轨迹点之间距离的某种集合来衡量的.
轨迹数据通常是一个2列的矩阵。
表示两点之间距离的平方。
library(SimilarityMeasures)
point1 <- c(0, 2, 4, 6)
point2 <- c(0, 1, 2, 3)
# Calculating the square distance between the two points
# in 4 dimensions.
DistanceSq(point1, point2, 4)
#> [1] 14
path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)
# Running the edit distance algorithm with point distance
# set to 2.
EditDist(path1, path2, 2)
#> [1] 4
# Creating two trajectories.
path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)
# Running the Frechet distance algorithm.
Frechet(path1, path2)
#> [1] 4
# Creating two trajectories.
path1 <- matrix(c(0, 1, 2, 3, 0, 1, 2, 3), 4)
path2 <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7), 4)
# Running the LCSS algorithm on the trajectories.
LCSS(path1, path2, 2, 2, 0.5)
#> [1] 4
查看可用的方法:
library(proxy)
#>
#> Attaching package: 'proxy'
#> The following objects are masked from 'package:stats':
#>
#> as.dist, dist
#> The following object is masked from 'package:base':
#>
#> as.matrix
summary(pr_DB)
#> * Similarity measures:
#> angular, Braun-Blanquet, Chi-squared, correlation, cosine, Cramer,
#> Dice, eDice, eJaccard, Fager, Faith, Gower, Hamman, Jaccard,
#> Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai, Pearson,
#> Phi, Phi-squared, Russel, simple matching, Simpson, Stiles, Tanimoto,
#> Tschuprow, Yule, Yule2
#>
#> * Distance measures:
#> Bhjattacharyya, Bray, Canberra, Chord, divergence, Euclidean, fJaccard,
#> Geodesic, Hellinger, Kullback, Levenshtein, Mahalanobis, Manhattan,
#> Minkowski, Podani, Soergel, supremum, Wave, Whittaker
获取更多的信息
### get more information about a particular one
pr_DB$get_entry("Minkowski")
#> names Minkowski, Lp
#> FUN R_minkowski_dist
#> distance TRUE
#> PREFUN pr_Minkowski_prefun
#> POSTFUN NA
#> convert pr_dist2simil
#> type metric
#> loop FALSE
#> C_FUN TRUE
#> PACKAGE proxy
#> abcd FALSE
#> formula (sum_i (x_i - y_i)^p)^(1/p)
#> reference Cox, T.F., and Cox, M.A.A. (2001. Multidimensional Scaling.
#> Chapmann and Hall.
#> description The Minkowski Distance (C implementation with compensation
#> for excluded components)
计算相似
simil(iris[,-5],method = "angular") %>% as.matrix() %>% head(2)
#> 1 2 3 4 5 6 7
#> 1 NA 0.9830297 0.9983988 0.9864988 0.9929924 0.9857816 0.9863416
#> 2 0.9830297 NA 0.9843491 0.9843651 0.9763486 0.9750367 0.9725351
#> 8 9 10 11 12 13 14
#> 1 0.9933168 0.986761 0.9841755 0.9989573 0.9855685 0.9846035 0.9930908
#> 2 0.9851632 0.987761 0.9916033 0.9834635 0.9777169 0.9938800 0.9782626
#> 15 16 17 18 19 20 21
#> 1 0.9828032 0.9803846 0.9851096 0.9949888 0.9914763 0.9867918 0.9828618
#> 2 0.9718794 0.9642312 0.9710386 0.9825171 0.9876802 0.9715024 0.9921042
#> 22 23 24 25 26 27 28
#> 1 0.9869477 0.9741827 0.9764484 0.9710550 0.9762942 0.9845258 0.9951870
#> 2 0.9741813 0.9580035 0.9804458 0.9689326 0.9904312 0.9813561 0.9865919
#> 29 30 31 32 33 34 35
#> 1 0.9930478 0.9834912 0.9817639 0.9843587 0.9787663 0.9825811 0.9848879
#> 2 0.9892400 0.9813546 0.9868892 0.9901833 0.9625006 0.9656629 0.9936099
#> 36 37 38 39 40 41 42
#> 1 0.9875766 0.9864303 0.9888018 0.9943369 0.9930673 0.9927684 0.9582555
#> 2 0.9859731 0.9854883 0.9730000 0.9848185 0.9880384 0.9783597 0.9747957
#> 43 44 45 46 47 48 49
#> 1 0.9906166 0.9769957 0.9737340 0.9840032 0.9857631 0.9925909 0.9969415
#> 2 0.9753911 0.9723538 0.9665975 0.9936207 0.9717843 0.9818579 0.9810418
#> 50 51 52 53 54 55 56
#> 1 0.9939616 0.8787985 0.8767640 0.8699271 0.8611813 0.8665980 0.8618849
#> 2 0.9890510 0.8891779 0.8857675 0.8800797 0.8717046 0.8770543 0.8706692
#> 57 58 59 60 61 62 63
#> 1 0.8710546 0.8828167 0.8720593 0.8686078 0.8652064 0.8747268 0.8669261
#> 2 0.8792161 0.8924188 0.8826212 0.8767809 0.8764698 0.8833765 0.8792391
#> 64 65 66 67 68 69 70
#> 1 0.8628367 0.8893923 0.8814207 0.8621151 0.8757166 0.8509198 0.8732403
#> 2 0.8720305 0.8982969 0.8917362 0.8697672 0.8857771 0.8624840 0.8836673
#> 71 72 73 74 75 76 77
#> 1 0.8583561 0.8807051 0.8500239 0.8626615 0.8778953 0.8782948 0.8657454
#> 2 0.8656949 0.8910803 0.8605736 0.8723120 0.8883378 0.8886894 0.8767163
#> 78 79 80 81 82 83 84
#> 1 0.8611908 0.8656952 0.8902768 0.8724049 0.8766608 0.8793496 0.8446785
#> 2 0.8709422 0.8747360 0.9013573 0.8830589 0.8875703 0.8894959 0.8538043
#> 85 86 87 88 89 90 91
#> 1 0.8588690 0.8741404 0.8727953 0.8592435 0.8764762 0.8664927 0.8584279
#> 2 0.8659967 0.8812316 0.8826930 0.8710732 0.8845189 0.8763204 0.8675625
#> 92 93 94 95 96 97 98
#> 1 0.8676405 0.8739236 0.8811596 0.8668719 0.8761385 0.8727477 0.8758999
#> 2 0.8765825 0.8843065 0.8915685 0.8760260 0.8844786 0.8814127 0.8858268
#> 99 100 101 102 103 104 105
#> 1 0.8964148 0.8734263 0.8295873 0.8381176 0.8417902 0.8390350 0.8350159
#> 2 0.9065453 0.8825743 0.8367046 0.8466236 0.8513442 0.8477548 0.8435036
#> 106 107 108 109 110 111 112
#> 1 0.8339144 0.8364510 0.8368224 0.8316990 0.8451486 0.8567023 0.8420301
#> 2 0.8438151 0.8440218 0.8468239 0.8419688 0.8530544 0.8651199 0.8516031
#> 113 114 115 116 117 118 119
#> 1 0.8464943 0.8332577 0.8329866 0.8467428 0.8460511 0.8458024 0.8205222
#> 2 0.8557646 0.8421141 0.8408265 0.8546356 0.8549517 0.8540089 0.8310123
#> 120 121 122 123 124 125 126
#> 1 0.8364867 0.8449269 0.8404612 0.8302110 0.8511738 0.8465187 0.8475273
#> 2 0.8471114 0.8536103 0.8482617 0.8406293 0.8609328 0.8546967 0.8568925
#> 127 128 129 130 131 132 133
#> 1 0.8545021 0.8549023 0.8351637 0.8496409 0.8391304 0.8556415 0.8339724
#> 2 0.8638349 0.8633284 0.8441457 0.8597485 0.8496029 0.8644329 0.8428858
#> 134 135 136 137 138 139 140
#> 1 0.8519413 0.8335717 0.8424384 0.8408979 0.8466156 0.8559859 0.8517638
#> 2 0.8614765 0.8428927 0.8527065 0.8478738 0.8550402 0.8642394 0.8610249
#> 141 142 143 144 145 146 147
#> 1 0.8412926 0.8553618 0.8381176 0.8395422 0.8415345 0.8489861 0.8430618
#> 2 0.8498406 0.8646212 0.8466236 0.8479531 0.8494488 0.8580733 0.8531901
#> 148 149 150
#> 1 0.8505477 0.8450978 0.8470100
#> 2 0.8594950 0.8519706 0.8548892
Levenshtein Distance/edit distance,这种距离就是描述两个字符有哪些不同。
edit distance is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.
上面这两个字符的距离是2!
R中的adist函数可以实现:
adist("exactly the same","exactly the same") # edit distance 0
#> [,1]
#> [1,] 0
adist("exactly the same","totally different") # edit distance 14
#> [,1]
#> [1,] 14
另外stringdist包中的stringdist函数也可以实现!
library(stringdist)
#calculate Levenshtein distance between two strings
stringdist("exactly the same", "totally different", method = "lv")
#> [1] 14
stringdist这个包还提供了很多其他的字符串处理工具,例如字符串匹配,模糊搜索。
这个包提供的距离,并且有详细介绍。包括:
计算出距离矩阵之后,可以使用任意的聚类算法了。
具体需要看了看论文!