Cluster Analysis:

Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Thus, there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

为什么要做聚类分析? 根据相似度把样本分成不同的组,组内成员具有相似性,组间成员具有差异性。

应用的例子: 1. Using consumer behavior data to identify distinct segments within a market. 2. Identifying distinct groups of stocks that follow similar trading patterns.

Market segmentation(市场划分) and pattern grouping(按规律分组) are both good examples where clustering is appropriate.

Market segmentation is the activity of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics.

距离,相似性,差异性的度量。

Euclidean distance formula: 欧几里得距离: \[d(x, y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2}\] 例如有两个运动员,我们要计算他们之间的空间距离,使用函数dist()
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbGF5ZXJzPC1tYXRyaXgocmJpbmQoYygwLDApLCBjKDMsNCkpLCBucm93ID0gMixcbiAgICAgICAgICAgICAgICBkaW1uYW1lcyA9IGxpc3QoYyhcInJlZFwiLCBcImJsdWVcIiksXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYyhcInhcIiwgXCJ5XCIpKSlcbnBsYXllcnNcblxuZGlzdChwbGF5ZXJzKSJ9

推而广之,如果我们要计算两个观测样本彼此之间的特征距离(不仅是空间上的x,y轴,还包括其它特征,速度,年龄,进球次数等),欧几里得距离的一般公式:

\[d(x, y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_n-y_n)^2}\] The dist() function makes life easier when working with many dimensions and observations: 得到距离矩阵:

例子:。。。

度量单位与标准化变量:

身高与体重,厘米与米或者,怎么样他们之间的距离才有意义? 标准化!

\[X_{scaled}=\frac{X-mean(X)}{sd(X)}\] 这样所有变量的均值是0,标准差是1.

Hierachical Clustering

重要概念: linkage - 计算某一个样本与其它群组的距离。例如在一群球员在场上,已知其中两名球员距离最近,那剩下的球员里,谁离他们俩最近?找出他!以此类推,把更多的人一层一层圈进来,就像一个石头扔进水里泛出的涟漪,或者金字塔的自上而下。

有三种方法:max, min, average来决定第三个人与前两个人形成的组的距离,可以是第三者与两个的人的平均距离,最短距离,最大距离。不同的方法出来的聚类分析会有不同的结果。

Calculating linkage

install.packages(“mclust”)

使用hclust()来计算linkage 与 使用cutree提取预测的分类结果: 在后台,我们已经载入lineup的数据,是两队各6名球员在球场上的首发位置,我们来根据他们彼此间的距离来给他们分队,看看我们的分队(分类)结果与实际的分队有什么差距。

我们已知是两个队在场上,所以K=2。

hclust()的具体使用方法: hclust(distance_matrix, method = “complete”)。还可以是single, average。

由上面的函数可知,我们还需要distance_matrix。我们之前讲到距离,每一个人与其它所有人的距离,由这些距离组成的矩阵就是distance_matrix。

下面我们来实际操作:

  1. 计算场上运动员之间的distance matrix,命名为dist_players

  2. 使用hclust中的complete method 来做hierarchical clustering 并把结果命名为 hc_players。

  3. 将上面的结果按照k=2,也就是有两个组的情况,使用cutree(), 将得到的结果命名为 clusters_k2, 然后将其合并到原来的数据lineup中,形成新的数据框 lineup_k2_complete

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpIiwic2FtcGxlIjoibGlicmFyeShtY2x1c3QpXG5saWJyYXJ5KGRwbHlyKVxuXG4jIENhbGN1bGF0ZSB0aGUgRGlzdGFuY2VcbmRpc3RfcGxheWVycyA8LSBfX19cblxuIyBQZXJmb3JtIHRoZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyB0aGUgY29tcGxldGUgbGlua2FnZVxuaGNfcGxheWVycyA8LSBfX19cblxuIyBDYWxjdWxhdGUgdGhlIGFzc2lnbm1lbnQgdmVjdG9yIHdpdGggYSBrIG9mIDJcbmNsdXN0ZXJzX2syIDwtIF9fX1xuXG4jIENyZWF0ZSBhIG5ldyBkYXRhZnJhbWUgc3RvcmluZyB0aGVzZSByZXN1bHRzXG5saW5ldXBfazJfY29tcGxldGUgPC0gbXV0YXRlKGxpbmV1cCwgY2x1c3RlciA9IF9fXykiLCJzb2x1dGlvbiI6ImxpYnJhcnkobWNsdXN0KVxubGlicmFyeShkcGx5cilcblxuIyBDYWxjdWxhdGUgdGhlIERpc3RhbmNlXG5kaXN0X3BsYXllcnMgPC0gZGlzdChsaW5ldXApXG5cbiMgUGVyZm9ybSB0aGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgdXNpbmcgdGhlIGNvbXBsZXRlIGxpbmthZ2VcbmhjX3BsYXllcnMgPC0gaGNsdXN0KGRpc3RfcGxheWVycywgbWV0aG9kID0gXCJjb21wbGV0ZVwiKVxuXG4jIENhbGN1bGF0ZSB0aGUgYXNzaWdubWVudCB2ZWN0b3Igd2l0aCBhIGsgb2YgMlxuY2x1c3RlcnNfazIgPC0gY3V0cmVlKGhjX3BsYXllcnMsIGsgPSAyKVxuXG4jIENyZWF0ZSBhIG5ldyBkYXRhZnJhbWUgc3RvcmluZyB0aGVzZSByZXN1bHRzXG5saW5ldXBfazJfY29tcGxldGUgPC0gbXV0YXRlKGxpbmV1cCwgY2x1c3RlciA9IGNsdXN0ZXJzX2syKSJ9

检查我们的分类结果

  1. 数一下每一个cluster各有几名球员

  2. 将球员的位置绘制在图上,并用颜色区分两个cluster

  3. 观察我们的结果与实际情况是否一直?

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG1jbHVzdClcbmxpYnJhcnkoZHBseXIpXG5cbiMgQ2FsY3VsYXRlIHRoZSBEaXN0YW5jZVxuZGlzdF9wbGF5ZXJzIDwtIGRpc3QobGluZXVwKVxuXG4jIFBlcmZvcm0gdGhlIGhpZXJhcmNoaWNhbCBjbHVzdGVyaW5nIHVzaW5nIHRoZSBjb21wbGV0ZSBsaW5rYWdlXG5oY19wbGF5ZXJzIDwtIGhjbHVzdChkaXN0X3BsYXllcnMsIG1ldGhvZCA9IFwiY29tcGxldGVcIilcblxuIyBDYWxjdWxhdGUgdGhlIGFzc2lnbm1lbnQgdmVjdG9yIHdpdGggYSBrIG9mIDJcbmNsdXN0ZXJzX2syIDwtIGN1dHJlZShoY19wbGF5ZXJzLCBrID0gMilcblxuIyBDcmVhdGUgYSBuZXcgZGF0YWZyYW1lIHN0b3JpbmcgdGhlc2UgcmVzdWx0c1xubGluZXVwX2syX2NvbXBsZXRlIDwtIG11dGF0ZShsaW5ldXAsIGNsdXN0ZXIgPSBjbHVzdGVyc19rMikiLCJzYW1wbGUiOiJsaWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG1jbHVzdClcbmxpYnJhcnkoZHBseXIpXG5cbiMgQ291bnQgdGhlIGNsdXN0ZXIgYXNzaWdubWVudHNcbmNvdW50KGxpbmV1cF9rMl9jb21wbGV0ZSwgX19fKVxuXG4jIFBsb3QgdGhlIHBvc2l0aW9ucyBvZiB0aGUgcGxheWVycyBhbmQgY29sb3IgdGhlbSB1c2luZyB0aGVpciBjbHVzdGVyXG5nZ3Bsb3QobGluZXVwX2syX2NvbXBsZXRlLCBhZXMoeCA9IF9fXywgeSA9IF9fXywgY29sb3IgPSBmYWN0b3IoX19fKSkpICtcbiAgZ2VvbV9wb2ludCgpIiwic29sdXRpb24iOiJsaWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG1jbHVzdClcbmxpYnJhcnkoZHBseXIpXG4jIENvdW50IHRoZSBjbHVzdGVyIGFzc2lnbm1lbnRzXG5jb3VudChsaW5ldXBfazJfY29tcGxldGUkY2x1c3RlcilcblxuIyBQbG90IHRoZSBwb3NpdGlvbnMgb2YgdGhlIHBsYXllcnMgYW5kIGNvbG9yIHRoZW0gdXNpbmcgdGhlaXIgY2x1c3RlclxuZ2dwbG90KGxpbmV1cF9rMl9jb21wbGV0ZSwgYWVzKHggPSB4LCB5ID0geSwgY29sb3IgPSBmYWN0b3IoY2x1c3RlcikpKSArXG4gIGdlb21fcG9pbnQoKSJ9

组数正确?每组内数量正确?

dendrogram,可视化我们的分类结果!

注意,不同的linkage方法可能会产生不同的结果

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KG1jbHVzdClcbmxpYnJhcnkoZHBseXIpXG5cbmxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpXG4jIENhbGN1bGF0ZSB0aGUgRGlzdGFuY2VcbmRpc3RfcGxheWVycyA8LSBkaXN0KGxpbmV1cClcblxuIyBQZXJmb3JtIHRoZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyB0aGUgY29tcGxldGUgbGlua2FnZVxuaGNfcGxheWVycyA8LSBoY2x1c3QoZGlzdF9wbGF5ZXJzLCBtZXRob2QgPSBcImNvbXBsZXRlXCIpXG5wbG90KGhjX3BsYXllcnMpIn0=

cutting the tree!

根据dendrogram的高度来砍一刀,确定我们得到几个组?

先安装这个 install.packages(“dendextend”)

这里我们来借用dendrogram,确定一个最高的高度(h),砍一刀,在这一刀下面,我们得以把观测样本分成不同的组。

我们会使用dendextend包里的color_branches()函数来将分组涂色。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaW5ldXAgPC0gcmVhZC5kZWxpbShcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvNW9sc3hoYTl1YW0xM3V6L2xpbmV1cC50eHQ/ZGw9MVwiKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShtY2x1c3QpXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeShkZW5kZXh0ZW5kKVxuXG5kaXN0X3BsYXllcnMgPC0gZGlzdChsaW5ldXAsIG1ldGhvZCA9ICdldWNsaWRlYW4nKVxuaGNfcGxheWVycyA8LSBoY2x1c3QoZGlzdF9wbGF5ZXJzLCBtZXRob2QgPSBcImNvbXBsZXRlXCIpXG5cbiMgQ3JlYXRlIGEgZGVuZHJvZ3JhbSBvYmplY3QgZnJvbSB0aGUgaGNsdXN0IHZhcmlhYmxlXG5kZW5kX3BsYXllcnMgPC0gYXMuZGVuZHJvZ3JhbShfX18pXG5cbiMgUGxvdCB0aGUgZGVuZHJvZ3JhbVxuXG5cbiMgQ29sb3IgYnJhbmNoZXMgYnkgY2x1c3RlciBmb3JtZWQgZnJvbSB0aGUgY3V0IGF0IGEgaGVpZ2h0IG9mIDIwICYgcGxvdFxuZGVuZF8yMCA8LSBjb2xvcl9icmFuY2hlcyhfX18sIGggPSBfX18pXG5cbiMgUGxvdCB0aGUgZGVuZHJvZ3JhbSB3aXRoIGNsdXN0ZXJzIGNvbG9yZWQgYmVsb3cgaGVpZ2h0IDIwXG5cblxuIyBDb2xvciBicmFuY2hlcyBieSBjbHVzdGVyIGZvcm1lZCBmcm9tIHRoZSBjdXQgYXQgYSBoZWlnaHQgb2YgNDAgJiBwbG90XG5kZW5kXzQwIDwtIF9fX1xuXG4jIFBsb3QgdGhlIGRlbmRyb2dyYW0gd2l0aCBjbHVzdGVycyBjb2xvcmVkIGJlbG93IGhlaWdodCA0MCIsInNvbHV0aW9uIjoibGluZXVwIDwtIHJlYWQuZGVsaW0oXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzVvbHN4aGE5dWFtMTN1ei9saW5ldXAudHh0P2RsPTFcIilcbmxpYnJhcnkoZ2dwbG90MilcbmxpYnJhcnkobWNsdXN0KVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoZGVuZGV4dGVuZClcblxuZGlzdF9wbGF5ZXJzIDwtIGRpc3QobGluZXVwLCBtZXRob2QgPSAnZXVjbGlkZWFuJylcbmhjX3BsYXllcnMgPC0gaGNsdXN0KGRpc3RfcGxheWVycywgbWV0aG9kID0gXCJjb21wbGV0ZVwiKVxuXG4jIENyZWF0ZSBhIGRlbmRyb2dyYW0gb2JqZWN0IGZyb20gdGhlIGhjbHVzdCB2YXJpYWJsZVxuZGVuZF9wbGF5ZXJzIDwtIGFzLmRlbmRyb2dyYW0oaGNfcGxheWVycylcblxuIyBQbG90IHRoZSBkZW5kcm9ncmFtXG5wbG90KGRlbmRfcGxheWVycylcblxuIyBDb2xvciBicmFuY2hlcyBieSBjbHVzdGVyIGZvcm1lZCBmcm9tIHRoZSBjdXQgYXQgYSBoZWlnaHQgb2YgMjAgJiBwbG90XG5kZW5kXzIwIDwtIGNvbG9yX2JyYW5jaGVzKGRlbmRfcGxheWVycywgaCA9IDIwKVxuXG4jIFBsb3QgdGhlIGRlbmRyb2dyYW0gd2l0aCBjbHVzdGVycyBjb2xvcmVkIGJlbG93IGhlaWdodCAyMFxucGxvdChkZW5kXzIwKVxuXG4jIENvbG9yIGJyYW5jaGVzIGJ5IGNsdXN0ZXIgZm9ybWVkIGZyb20gdGhlIGN1dCBhdCBhIGhlaWdodCBvZiA0MCAmIHBsb3RcbmRlbmRfNDAgPC0gY29sb3JfYnJhbmNoZXMoZGVuZF9wbGF5ZXJzLCBoID0gNDApXG5cbiMgUGxvdCB0aGUgZGVuZHJvZ3JhbSB3aXRoIGNsdXN0ZXJzIGNvbG9yZWQgYmVsb3cgaGVpZ2h0IDQwXG5wbG90KGRlbmRfNDApIn0=

是时候升级挑战了!clustering wholesale customers!

之前的球员展位在水平面上只有横纵两个方向的值来表示其位置特征。现在这个wholesale customers的数据一个customer的特征远远大于两个,所以我们最后不用作图来展示不同聚类的特征,我们用描述统计,具体的,我们下面就来看。

步骤与前面的球员例子一模一样,除了在这里我们是要在height =15,000的地方砍一刀来确定分几个类(组),还有这里我们要查看每个组的消费者的平均特征。

动手吧!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImN1c3RvbWVyc19zcGVuZCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy9sbHJwMHZ4dXRlcWtybXkvd3NfY3VzdG9tZXJzLnR4dD9kbD0xXCIpIiwic2FtcGxlIjoibGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShtY2x1c3QpXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeShkZW5kZXh0ZW5kKVxuXG5kaXN0X2N1c3RvbWVycyA8LSBkaXN0KGN1c3RvbWVyc19zcGVuZClcbmhjX2N1c3RvbWVycyA8LSBoY2x1c3QoZGlzdF9jdXN0b21lcnMpXG5jbHVzdF9jdXN0b21lcnMgPC0gY3V0cmVlKGhjX2N1c3RvbWVycywgaCA9IDE1MDAwKVxuc2VnbWVudF9jdXN0b21lcnMgPC0gbXV0YXRlKGN1c3RvbWVyc19zcGVuZCwgY2x1c3RlciA9IGNsdXN0X2N1c3RvbWVycylcblxuIyBDb3VudCB0aGUgbnVtYmVyIG9mIGN1c3RvbWVycyB0aGF0IGZhbGwgaW50byBlYWNoIGNsdXN0ZXJcbmNvdW50KF9fXywgX19fKVxuXG4jIENvbG9yIHRoZSBkZW5kcm9ncmFtIGJhc2VkIG9uIHRoZSBoZWlnaHQgY3V0b2ZmXG5kZW5kX2N1c3RvbWVycyA8LSBhcy5kZW5kcm9ncmFtKGhjX2N1c3RvbWVycylcbmRlbmRfY29sb3JlZCA8LSBjb2xvcl9icmFuY2hlcyhfX18sIF9fXylcblxuIyBQbG90IHRoZSBjb2xvcmVkIGRlbmRyb2dyYW1cblxuXG4jIENhbGN1bGF0ZSB0aGUgbWVhbiBmb3IgZWFjaCBjYXRlZ29yeVxuc2VnbWVudF9jdXN0b21lcnMgJT4lIFxuICBncm91cF9ieShjbHVzdGVyKSAlPiUgXG4gIHN1bW1hcmlzZV9hbGwoZnVucyhtZWFuKC4pKSkiLCJzb2x1dGlvbiI6ImxpYnJhcnkoZ2dwbG90MilcbmxpYnJhcnkobWNsdXN0KVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoZGVuZGV4dGVuZClcblxuZGlzdF9jdXN0b21lcnMgPC0gZGlzdChjdXN0b21lcnNfc3BlbmQpXG5oY19jdXN0b21lcnMgPC0gaGNsdXN0KGRpc3RfY3VzdG9tZXJzKVxuY2x1c3RfY3VzdG9tZXJzIDwtIGN1dHJlZShoY19jdXN0b21lcnMsIGggPSAxNTAwMClcbnNlZ21lbnRfY3VzdG9tZXJzIDwtIG11dGF0ZShjdXN0b21lcnNfc3BlbmQsIGNsdXN0ZXIgPSBjbHVzdF9jdXN0b21lcnMpXG5cbiMgQ291bnQgdGhlIG51bWJlciBvZiBjdXN0b21lcnMgdGhhdCBmYWxsIGludG8gZWFjaCBjbHVzdGVyXG5jb3VudChzZWdtZW50X2N1c3RvbWVycyRjbHVzdGVyKVxuXG4jIENvbG9yIHRoZSBkZW5kcm9ncmFtIGJhc2VkIG9uIHRoZSBoZWlnaHQgY3V0b2ZmXG5kZW5kX2N1c3RvbWVycyA8LSBhcy5kZW5kcm9ncmFtKGhjX2N1c3RvbWVycylcbmRlbmRfY29sb3JlZCA8LSBjb2xvcl9icmFuY2hlcyhkZW5kX2N1c3RvbWVycywgaCA9IDE1MDAwKVxuXG4jIFBsb3QgdGhlIGNvbG9yZWQgZGVuZHJvZ3JhbVxucGxvdChkZW5kX2NvbG9yZWQpXG5cbiMgQ2FsY3VsYXRlIHRoZSBtZWFuIGZvciBlYWNoIGNhdGVnb3J5XG5zZWdtZW50X2N1c3RvbWVycyAlPiUgXG4gIGdyb3VwX2J5KGNsdXN0ZXIpICU+JSBcbiAgc3VtbWFyaXNlX2FsbChmdW5zKG1lYW4oLikpKSJ9

用k means 做聚类分析

跟之前一样,我们用lineup 这个数据,里面是开场前两个球队球员的场中位置。 因为我们知道这是两个队的比赛,所以我们的K=2,没毛病。

我们的目标是,把球员各归各队各找各妈。

我们在kmeans() 这个函数中,将参数k的取值规定为2.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpIiwic2FtcGxlIjoibGlicmFyeShkcGx5cilcbmxpYnJhcnkoZ2dwbG90MilcblxuIyBCdWlsZCBhIGttZWFucyBtb2RlbFxubW9kZWxfa20yIDwtIGttZWFucyhfX18sIGNlbnRlcnMgPSBfX18pXG5cbiMgRXh0cmFjdCB0aGUgY2x1c3RlciBhc3NpZ25tZW50IHZlY3RvciBmcm9tIHRoZSBrbWVhbnMgbW9kZWxcbmNsdXN0X2ttMiA8LSBfX19cblxuIyBDcmVhdGUgYSBuZXcgZGF0YWZyYW1lIGFwcGVuZGluZyB0aGUgY2x1c3RlciBhc3NpZ25tZW50XG5saW5ldXBfa20yIDwtIG11dGF0ZShfX18sIGNsdXN0ZXIgPSBfX18pXG5cbiMgUGxvdCB0aGUgcG9zaXRpb25zIG9mIHRoZSBwbGF5ZXJzIGFuZCBjb2xvciB0aGVtIHVzaW5nIHRoZWlyIGNsdXN0ZXJcbmdncGxvdChfX18sIGFlcyh4ID0gX19fLCB5ID0gX19fLCBjb2xvciA9IGZhY3RvcihfX18pKSkgK1xuICBnZW9tX3BvaW50KCkiLCJzb2x1dGlvbiI6ImxpYnJhcnkoZHBseXIpXG5saWJyYXJ5KGdncGxvdDIpXG5cbiMgQnVpbGQgYSBrbWVhbnMgbW9kZWxcbm1vZGVsX2ttMiA8LSBrbWVhbnMobGluZXVwLCBjZW50ZXJzID0gMilcblxuIyBFeHRyYWN0IHRoZSBjbHVzdGVyIGFzc2lnbm1lbnQgdmVjdG9yIGZyb20gdGhlIGttZWFucyBtb2RlbFxuY2x1c3Rfa20yIDwtIG1vZGVsX2ttMiRjbHVzdGVyXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBhcHBlbmRpbmcgdGhlIGNsdXN0ZXIgYXNzaWdubWVudFxubGluZXVwX2ttMiA8LSBtdXRhdGUobGluZXVwLCBjbHVzdGVyID0gY2x1c3Rfa20yKVxuXG4jIFBsb3QgdGhlIHBvc2l0aW9ucyBvZiB0aGUgcGxheWVycyBhbmQgY29sb3IgdGhlbSB1c2luZyB0aGVpciBjbHVzdGVyXG5nZ3Bsb3QobGluZXVwX2ttMiwgYWVzKHggPSB4LCB5ID0geSwgY29sb3IgPSBmYWN0b3IoY2x1c3RlcikpKSArXG4gIGdlb21fcG9pbnQoKSJ9

注意:K可以任意取值,只要比总的样本量小就能运行,出结果,但是要小心结果是否真的有意义。

elbow rule:

在前面的例子,我们提前知道了k的取值,那如果我们不能提前知道呢?使用肘子法则! elbow rule: k 从小到大,依次取值,计算组内的方差的平均值,组分得越多,这个方差就越小,我们取那个使得组内方差急剧坠落的k值。

注意我们需要使用purrr 包里的 map_dbl()来循环k从1到10运行函数kmeans()

INSTRUCTIONS

  1. Use map_dbl() to run kmeans() using the lineup data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss
  2. Store the resulting vector as tot_withinss
  3. Build a new dataframe elbow_df containing the values of k and the vector of total within-cluster sum of squares
  4. Use the values in elbow_df to plot a line plot showing the relationship between k and total within-cluster sum of squares.

HINT map_dbl() lets you run any function for any vector or list, in this case the vector is 1:10 and the desired function is kmeans() For the elbow_df data frame you want to make sure to associate the tot_withinss with their associated values of k

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImN1c3RvbWVyc19zcGVuZCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy9sbHJwMHZ4dXRlcWtybXkvd3NfY3VzdG9tZXJzLnR4dD9kbD0xXCIpXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeShwdXJycilcbmxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIFVzZSBtYXBfZGJsIHRvIHJ1biBtYW55IG1vZGVscyB3aXRoIHZhcnlpbmcgdmFsdWUgb2Yga1xuc2lsX3dpZHRoIDwtIG1hcF9kYmwoMjoxMCwgIGZ1bmN0aW9uKGspe1xuICBtb2RlbCA8LSBwYW0oeCA9IF9fXywgayA9IF9fXylcbiAgbW9kZWwkc2lsaW5mbyRhdmcud2lkdGhcbn0pXG5cbiMgR2VuZXJhdGUgYSBkYXRhIGZyYW1lIGNvbnRhaW5pbmcgYm90aCBrIGFuZCBzaWxfd2lkdGhcbnNpbF9kZiA8LSBkYXRhLmZyYW1lKFxuICBrID0gX19fLFxuICBzaWxfd2lkdGggPSBfX19cbilcblxuIyBQbG90IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBrIGFuZCBzaWxfd2lkdGhcbmdncGxvdChfX18sIGFlcyh4ID0gX19fLCB5ID0gX19fKSkgK1xuICBnZW9tX2xpbmUoKSArXG4gIHNjYWxlX3hfY29udGludW91cyhicmVha3MgPSAyOjEwKSIsInNvbHV0aW9uIjoiIyBVc2UgbWFwX2RibCB0byBydW4gbWFueSBtb2RlbHMgd2l0aCB2YXJ5aW5nIHZhbHVlIG9mIGtcbnNpbF93aWR0aCA8LSBtYXBfZGJsKDI6MTAsICBmdW5jdGlvbihrKXtcbiAgbW9kZWwgPC0gcGFtKHggPSBjdXN0b21lcnNfc3BlbmQsIGsgPSBrKVxuICBtb2RlbCRzaWxpbmZvJGF2Zy53aWR0aFxufSlcblxuIyBHZW5lcmF0ZSBhIGRhdGEgZnJhbWUgY29udGFpbmluZyBib3RoIGsgYW5kIHNpbF93aWR0aFxuc2lsX2RmIDwtIGRhdGEuZnJhbWUoXG4gIGsgPSAyOjEwLFxuICBzaWxfd2lkdGggPSBzaWxfd2lkdGhcbilcblxuIyBQbG90IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBrIGFuZCBzaWxfd2lkdGhcbmdncGxvdChzaWxfZGYsIGFlcyh4ID0gaywgeSA9IHNpbF93aWR0aCkpICtcbiAgZ2VvbV9saW5lKCkgK1xuICBzY2FsZV94X2NvbnRpbnVvdXMoYnJlYWtzID0gMjoxMCkifQ==

Silhouette analysis

另一种确定k取值的方法!

具体是这样的:

Silhouette analysis 计算每一个观测样本与它所在组的相似度和它与其它组的相似度,并比较二者的大小,然后得出一个[-1,1]的值,1表示这个样本确实分组分得对,即应该在现在的组,0表示它既可以在现在的组,也可以去别的组,-1表示该样本分组错大发了,它其实应该在别的组。

在下面的练习中,我们使用cluster里面的pam()sihouette()来做Silhouette analysis,然后比较k=2与k=3的模型的结果。注意观察k=3时的最后的图形,是否每一个观测量都是老老实实属于我们计算出的分组?

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpXG5saWJyYXJ5KGRwbHlyKSIsInNhbXBsZSI6ImxpYnJhcnkoY2x1c3RlcilcblxuIyBHZW5lcmF0ZSBhIGstbWVhbnMgbW9kZWwgdXNpbmcgdGhlIHBhbSgpIGZ1bmN0aW9uIHdpdGggYSBrID0gMlxucGFtX2syIDwtIHBhbShfX18sIGsgPSBfX18pXG5cbiMgUGxvdCB0aGUgc2lsaG91ZXR0ZSB2aXN1YWwgZm9yIHRoZSBwYW1fazIgbW9kZWxcbnBsb3Qoc2lsaG91ZXR0ZShfX18pKVxuXG4jIEdlbmVyYXRlIGEgay1tZWFucyBtb2RlbCB1c2luZyB0aGUgcGFtKCkgZnVuY3Rpb24gd2l0aCBhIGsgPSAzXG5wYW1fazMgPC0gX19fXG5cbiMgUGxvdCB0aGUgc2lsaG91ZXR0ZSB2aXN1YWwgZm9yIHRoZSBwYW1fazMgbW9kZWwiLCJzb2x1dGlvbiI6ImxpYnJhcnkoY2x1c3RlcilcblxuIyBHZW5lcmF0ZSBhIGstbWVhbnMgbW9kZWwgdXNpbmcgdGhlIHBhbSgpIGZ1bmN0aW9uIHdpdGggYSBrID0gMlxucGFtX2syIDwtIHBhbShsaW5ldXAsIGsgPSAyKVxuXG4jIFBsb3QgdGhlIHNpbGhvdWV0dGUgdmlzdWFsIGZvciB0aGUgcGFtX2syIG1vZGVsXG5wbG90KHNpbGhvdWV0dGUocGFtX2syKSlcblxuIyBHZW5lcmF0ZSBhIGstbWVhbnMgbW9kZWwgdXNpbmcgdGhlIHBhbSgpIGZ1bmN0aW9uIHdpdGggYSBrID0gM1xucGFtX2szIDwtIHBhbShsaW5ldXAsIGsgPSAzKVxuXG4jIFBsb3QgdGhlIHNpbGhvdWV0dGUgdmlzdWFsIGZvciB0aGUgcGFtX2szIG1vZGVsXG5wbG90KHNpbGhvdWV0dGUocGFtX2szKSkifQ==

聪明的你,已经猜到我们下面要干什么了吧,对,就是要来在ws_customers数据上试一试我们的分组新方法!

步骤如下:

  1. Use map_dbl() to run pam() using the customers_spend data for k values ranging from 2 to 10 and extract the average silhouette width value from each model: model$silinfo$avg.width Store the resulting vector as sil_width
  2. Build a new dataframe sil_df containing the values of k and the vector of average silhouette widths
  3. Use the values in sil_df to plot a line plot showing the relationship between k and average silhouette width

HINT: If you are unsure about any of these steps I recommend you to revisit Exercise 5 of this chapter.

好,我们现在已经找到了k=2是最合适的,那我们就基于k=2来深入了解一下我们的消费者吧!

  1. Build a k-means model called model_customers for the customers_spend data using the kmeans() function with centers = 2.
  2. Extract the vector of cluster assignments from the model model_customers$cluster and store this in the variable clust_customers.
  3. Append the cluster assignments as a column cluster to the customers_spend data frame and save the results to a new dataframe called segment_customers. Calculate the size of each cluster using count().
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjdXN0b21lcnNfc3BlbmQgPC0gcmVhZC5kZWxpbShcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvbGxycDB2eHV0ZXFrcm15L3dzX2N1c3RvbWVycy50eHQ/ZGw9MVwiKVxubGlicmFyeShwdXJycilcbmxpYnJhcnkoY2x1c3RlcilcbmxpYnJhcnkoZHBseXIpXG5cbnNldC5zZWVkKDQyKVxuXG4jIEJ1aWxkIGEgay1tZWFucyBtb2RlbCBmb3IgdGhlIGN1c3RvbWVyc19zcGVuZCB3aXRoIGEgayBvZiAyXG5tb2RlbF9jdXN0b21lcnMgPC0gX19fXG5cbiMgRXh0cmFjdCB0aGUgdmVjdG9yIG9mIGNsdXN0ZXIgYXNzaWdubWVudHMgZnJvbSB0aGUgbW9kZWxcbmNsdXN0X2N1c3RvbWVycyA8LSBfX19cblxuIyBCdWlsZCB0aGUgc2VnbWVudF9jdXN0b21lcnMgZGF0YWZyYW1lXG5zZWdtZW50X2N1c3RvbWVycyA8LSBtdXRhdGUoX19fLCBjbHVzdGVyID0gX19fKVxuXG4jIENhbGN1bGF0ZSB0aGUgc2l6ZSBvZiBlYWNoIGNsdXN0ZXJcbmNvdW50KF9fXywgX19fKVxuXG4jIENhbGN1bGF0ZSB0aGUgbWVhbiBmb3IgZWFjaCBjYXRlZ29yeVxuc2VnbWVudF9jdXN0b21lcnMgJT4lIFxuICBncm91cF9ieShjbHVzdGVyKSAlPiUgXG4gIHN1bW1hcmlzZV9hbGwoZnVucyhtZWFuKC4pKSkiLCJzb2x1dGlvbiI6ImN1c3RvbWVyc19zcGVuZCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy9sbHJwMHZ4dXRlcWtybXkvd3NfY3VzdG9tZXJzLnR4dD9kbD0xXCIpXG5saWJyYXJ5KHB1cnJyKVxubGlicmFyeShjbHVzdGVyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoZ2dwbG90MilcbnNldC5zZWVkKDQyKVxuXG4jIEJ1aWxkIGEgay1tZWFucyBtb2RlbCBmb3IgdGhlIGN1c3RvbWVyc19zcGVuZCB3aXRoIGEgayBvZiAyXG5tb2RlbF9jdXN0b21lcnMgPC0ga21lYW5zKGN1c3RvbWVyc19zcGVuZCwgY2VudGVycyA9IDIpXG5cbiMgRXh0cmFjdCB0aGUgdmVjdG9yIG9mIGNsdXN0ZXIgYXNzaWdubWVudHMgZnJvbSB0aGUgbW9kZWxcbmNsdXN0X2N1c3RvbWVycyA8LSBtb2RlbF9jdXN0b21lcnMkY2x1c3RlclxuXG4jIEJ1aWxkIHRoZSBzZWdtZW50X2N1c3RvbWVycyBkYXRhZnJhbWVcbnNlZ21lbnRfY3VzdG9tZXJzIDwtIG11dGF0ZShjdXN0b21lcnNfc3BlbmQsIGNsdXN0ZXIgPSBjbHVzdF9jdXN0b21lcnMpXG5cbiMgQ2FsY3VsYXRlIHRoZSBzaXplIG9mIGVhY2ggY2x1c3RlclxuY291bnQoc2VnbWVudF9jdXN0b21lcnMkY2x1c3RlcilcblxuIyBDYWxjdWxhdGUgdGhlIG1lYW4gZm9yIGVhY2ggY2F0ZWdvcnlcbnNlZ21lbnRfY3VzdG9tZXJzICU+JSBcbiAgZ3JvdXBfYnkoY2x1c3RlcikgJT4lIFxuICBzdW1tYXJpc2VfYWxsKGZ1bnMobWVhbiguKSkpIn0=

Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Remember that: Generating clusters is a science, but interpreting them is an art.

“最佳”K值的选择: 1. 没有独立的最佳 2. 取决于你要研究的问题 3. 取决于你对数据的了解 4. 取决于你的专业知识

我们学习了: 1. 什么是距离 2. 为什么标准化很重要 3. linkage是如何起作用的 4. denfrogram是怎么绘制的 5. 如何分析你的clusters 6. k-means是如何工作的 7. 如何估计k 8. 如何分析一个观测量与一个cluster的契合程度