聚类分析 Cluster Analysis:

Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Thus, there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

为什么要做聚类分析?

根据相似度把样本分成不同的组,组内成员具有相似性,组间成员具有差异性。

应用的例子: 1. Using consumer behavior data to identify distinct segments within a market. 2. Identifying distinct groups of stocks that follow similar trading patterns.

Market segmentation(市场划分) and pattern grouping(按规律分组) are both good examples where clustering is appropriate.

Market segmentation is the activity of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics.

相似度,用距离来度量。

欧几里得距离 Euclidean distance formula:

\[d(x, y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2}\]

例如有两个运动员,我们要计算他们之间的空间距离,使用函数dist()
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbGF5ZXJzPC1tYXRyaXgocmJpbmQoYygwLDApLCBjKDMsNCkpLCBucm93ID0gMixcbiAgICAgICAgICAgICAgICBkaW1uYW1lcyA9IGxpc3QoYyhcInJlZFwiLCBcImJsdWVcIiksXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYyhcInhcIiwgXCJ5XCIpKSlcbnBsYXllcnNcblxuZGlzdChwbGF5ZXJzKSJ9

推而广之,如果我们要计算两个观测样本彼此之间的特征距离(不仅是空间上的x,y轴,还包括其它特征,速度,年龄,进球次数等),欧几里得距离的一般公式:

\[d(x, y) = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_n-y_n)^2}\]

度量单位与变量的标准化:

身高与体重,厘米与米或者,怎么样他们之间的距离才有意义?

\[X_{scaled}=\frac{X-mean(X)}{sd(X)}\]

这样所有变量的均值是0,标准差是1.

Hierachical Clustering

重要概念: linkage - 计算某一个样本与其它群组的距离。例如在一群球员在场上,已知其中两名球员距离最近,那剩下的球员里,谁离他们俩最近?找出他!以此类推,把更多的人一层一层圈进来,就像一个石头扔进水里泛出的涟漪,或者金字塔的自上而下。

Hierachical Clustering 原理

  1. 找到两个距离最近的样本,将他们归为一组

  2. 计算其它样本与该组的距离,最近的样本与该组划在一起(linkage)
  1. 如此不断推进,直到所有样本形成一个大组(类)。

  2. 根据我们的专业经验,确定要分几个组。

  3. 每个样本依据上一步的模型,被分配到不同的组。

Hierachical Clustering 实操步骤

  1. 预备数据,使用dist()函数得到distance_matrix。

  2. 将得到的函数得到distance_matrix传入hclust(distance_matrix, method = “complete”),并选定linkage的方式(complete, single, average)。

  1. 将分组结果与原数据合并。

  2. 分析聚类结果

Hierachical Clustering 实操

install.packages(“mclust”)

使用hclust()来计算linkage 与 使用cutree提取预测的分类结果: 在后台,我们已经载入lineup的数据,是两队各6名球员在球场上的首发位置,我们来根据他们彼此间的距离来给他们分队,看看我们的分队(分类)结果与实际的分队有什么差距。

我们已知是两个队在场上,所以K=2。

hclust()的具体使用方法: hclust(distance_matrix, method = “complete”)。还可以是single, average。

由上面的函数可知,我们还需要distance_matrix。我们之前讲到距离,每一个人与其它所有人的距离,由这些距离组成的矩阵就是distance_matrix。

下面我们来实际操作:

  1. 计算场上运动员之间的distance matrix,命名为dist_players

  2. 使用hclust中的complete method 来做hierarchical clustering 并把结果命名为 hc_players。

  3. 将上面的结果按照k=2,也就是有两个组的情况,使用cutree(), 将得到的结果命名为 clusters_k2, 然后将其合并到原来的数据lineup中,形成新的数据框 lineup_k2_complete

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpIiwic2FtcGxlIjoibGlicmFyeShkcGx5cilcblxuIyBDYWxjdWxhdGUgdGhlIERpc3RhbmNlXG5kaXN0X3BsYXllcnMgPC0gX19fXG5cbiMgUGVyZm9ybSB0aGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgdXNpbmcgdGhlIGNvbXBsZXRlIGxpbmthZ2VcbmhjX3BsYXllcnMgPC0gX19fXG5cbiMgQ2FsY3VsYXRlIHRoZSBhc3NpZ25tZW50IHZlY3RvciB3aXRoIGEgayBvZiAyXG5jbHVzdGVyc19rMiA8LSBfX19cblxuIyBDcmVhdGUgYSBuZXcgZGF0YWZyYW1lIHN0b3JpbmcgdGhlc2UgcmVzdWx0c1xubGluZXVwX2syX2NvbXBsZXRlIDwtIG11dGF0ZShsaW5ldXAsIGNsdXN0ZXIgPSBfX18pIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxuXG4jIENhbGN1bGF0ZSB0aGUgRGlzdGFuY2VcbmRpc3RfcGxheWVycyA8LSBkaXN0KGxpbmV1cClcblxuIyBQZXJmb3JtIHRoZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyB0aGUgY29tcGxldGUgbGlua2FnZVxuaGNfcGxheWVycyA8LSBoY2x1c3QoZGlzdF9wbGF5ZXJzLCBtZXRob2QgPSBcImNvbXBsZXRlXCIpXG5cbiMgQ2FsY3VsYXRlIHRoZSBhc3NpZ25tZW50IHZlY3RvciB3aXRoIGEgayBvZiAyXG5jbHVzdGVyc19rMiA8LSBjdXRyZWUoaGNfcGxheWVycywgayA9IDIpXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBzdG9yaW5nIHRoZXNlIHJlc3VsdHNcbmxpbmV1cF9rMl9jb21wbGV0ZSA8LSBtdXRhdGUobGluZXVwLCBjbHVzdGVyID0gY2x1c3RlcnNfazIpIn0=

检查我们的分类结果

  1. 数一下每一个cluster各有几名球员

  2. 将球员的位置绘制在图上,并用颜色区分两个cluster

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KGRwbHlyKVxuXG4jIENhbGN1bGF0ZSB0aGUgRGlzdGFuY2VcbmRpc3RfcGxheWVycyA8LSBkaXN0KGxpbmV1cClcblxuIyBQZXJmb3JtIHRoZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyB0aGUgY29tcGxldGUgbGlua2FnZVxuaGNfcGxheWVycyA8LSBoY2x1c3QoZGlzdF9wbGF5ZXJzLCBtZXRob2QgPSBcImNvbXBsZXRlXCIpXG5cbiMgQ2FsY3VsYXRlIHRoZSBhc3NpZ25tZW50IHZlY3RvciB3aXRoIGEgayBvZiAyXG5jbHVzdGVyc19rMiA8LSBjdXRyZWUoaGNfcGxheWVycywgayA9IDIpXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBzdG9yaW5nIHRoZXNlIHJlc3VsdHNcbmxpbmV1cF9rMl9jb21wbGV0ZSA8LSBtdXRhdGUobGluZXVwLCBjbHVzdGVyID0gY2x1c3RlcnNfazIpIiwic2FtcGxlIjoibGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcblxuIyBDb3VudCB0aGUgY2x1c3RlciBhc3NpZ25tZW50c1xuY291bnQobGluZXVwX2syX2NvbXBsZXRlLCBfX18pXG5cbiMgUGxvdCB0aGUgcG9zaXRpb25zIG9mIHRoZSBwbGF5ZXJzIGFuZCBjb2xvciB0aGVtIHVzaW5nIHRoZWlyIGNsdXN0ZXJcbmdncGxvdChsaW5ldXBfazJfY29tcGxldGUsIGFlcyh4ID0gX19fLCB5ID0gX19fLCBjb2xvciA9IGZhY3RvcihfX18pKSkgK1xuICBnZW9tX3BvaW50KCkiLCJzb2x1dGlvbiI6ImxpYnJhcnkoZ2dwbG90MilcbmxpYnJhcnkoZHBseXIpXG4jIENvdW50IHRoZSBjbHVzdGVyIGFzc2lnbm1lbnRzXG5jb3VudChsaW5ldXBfazJfY29tcGxldGUkY2x1c3RlcilcblxuIyBQbG90IHRoZSBwb3NpdGlvbnMgb2YgdGhlIHBsYXllcnMgYW5kIGNvbG9yIHRoZW0gdXNpbmcgdGhlaXIgY2x1c3RlclxuZ2dwbG90KGxpbmV1cF9rMl9jb21wbGV0ZSwgYWVzKHggPSB4LCB5ID0geSwgY29sb3IgPSBmYWN0b3IoY2x1c3RlcikpKSArXG4gIGdlb21fcG9pbnQoKSJ9

注意,不同的linkage方法可能会产生不同的结果,试一试更改下面的method:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KGRwbHlyKVxuXG5saW5ldXAgPC0gcmVhZC5kZWxpbShcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvNW9sc3hoYTl1YW0xM3V6L2xpbmV1cC50eHQ/ZGw9MVwiKVxuIyBDYWxjdWxhdGUgdGhlIERpc3RhbmNlXG5kaXN0X3BsYXllcnMgPC0gZGlzdChsaW5ldXApXG5cbiMgUGVyZm9ybSB0aGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgdXNpbmcgdGhlIGNvbXBsZXRlIGxpbmthZ2VcbmhjX3BsYXllcnMgPC0gaGNsdXN0KGRpc3RfcGxheWVycywgbWV0aG9kID0gXCJjb21wbGV0ZVwiKVxucGxvdChoY19wbGF5ZXJzKSJ9

cutting the tree!

如果事先不知道几个组合适,怎么办?

根据dendrogram的高度来砍一刀,确定我们得到几个组?

先安装这个 install.packages(“dendextend”)

这里我们来借用dendrogram,确定一个最高的高度(h),砍一刀,在这一刀下面,我们得以把观测样本分成不同的组。

我们会使用dendextend包里的color_branches()函数来将分组涂色。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaW5ldXAgPC0gcmVhZC5kZWxpbShcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvNW9sc3hoYTl1YW0xM3V6L2xpbmV1cC50eHQ/ZGw9MVwiKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoZGVuZGV4dGVuZClcblxuZGlzdF9wbGF5ZXJzIDwtIGRpc3QobGluZXVwLCBtZXRob2QgPSAnZXVjbGlkZWFuJylcbmhjX3BsYXllcnMgPC0gaGNsdXN0KGRpc3RfcGxheWVycywgbWV0aG9kID0gXCJjb21wbGV0ZVwiKVxuXG4jIENyZWF0ZSBhIGRlbmRyb2dyYW0gb2JqZWN0IGZyb20gdGhlIGhjbHVzdCB2YXJpYWJsZVxuZGVuZF9wbGF5ZXJzIDwtIGFzLmRlbmRyb2dyYW0oX19fKVxuXG4jIFBsb3QgdGhlIGRlbmRyb2dyYW1cblxuXG4jIENvbG9yIGJyYW5jaGVzIGJ5IGNsdXN0ZXIgZm9ybWVkIGZyb20gdGhlIGN1dCBhdCBhIGhlaWdodCBvZiAyMCAmIHBsb3RcbmRlbmRfMjAgPC0gY29sb3JfYnJhbmNoZXMoX19fLCBoID0gX19fKVxuXG4jIFBsb3QgdGhlIGRlbmRyb2dyYW0gd2l0aCBjbHVzdGVycyBjb2xvcmVkIGJlbG93IGhlaWdodCAyMFxuXG5cbiMgQ29sb3IgYnJhbmNoZXMgYnkgY2x1c3RlciBmb3JtZWQgZnJvbSB0aGUgY3V0IGF0IGEgaGVpZ2h0IG9mIDQwICYgcGxvdFxuZGVuZF80MCA8LSBfX19cblxuIyBQbG90IHRoZSBkZW5kcm9ncmFtIHdpdGggY2x1c3RlcnMgY29sb3JlZCBiZWxvdyBoZWlnaHQgNDAiLCJzb2x1dGlvbiI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeShkZW5kZXh0ZW5kKVxuXG5kaXN0X3BsYXllcnMgPC0gZGlzdChsaW5ldXAsIG1ldGhvZCA9ICdldWNsaWRlYW4nKVxuaGNfcGxheWVycyA8LSBoY2x1c3QoZGlzdF9wbGF5ZXJzLCBtZXRob2QgPSBcImNvbXBsZXRlXCIpXG5cbiMgQ3JlYXRlIGEgZGVuZHJvZ3JhbSBvYmplY3QgZnJvbSB0aGUgaGNsdXN0IHZhcmlhYmxlXG5kZW5kX3BsYXllcnMgPC0gYXMuZGVuZHJvZ3JhbShoY19wbGF5ZXJzKVxuXG4jIFBsb3QgdGhlIGRlbmRyb2dyYW1cbnBsb3QoZGVuZF9wbGF5ZXJzKVxuXG4jIENvbG9yIGJyYW5jaGVzIGJ5IGNsdXN0ZXIgZm9ybWVkIGZyb20gdGhlIGN1dCBhdCBhIGhlaWdodCBvZiAyMCAmIHBsb3RcbmRlbmRfMjAgPC0gY29sb3JfYnJhbmNoZXMoZGVuZF9wbGF5ZXJzLCBoID0gMjApXG5cbiMgUGxvdCB0aGUgZGVuZHJvZ3JhbSB3aXRoIGNsdXN0ZXJzIGNvbG9yZWQgYmVsb3cgaGVpZ2h0IDIwXG5wbG90KGRlbmRfMjApXG5cbiMgQ29sb3IgYnJhbmNoZXMgYnkgY2x1c3RlciBmb3JtZWQgZnJvbSB0aGUgY3V0IGF0IGEgaGVpZ2h0IG9mIDQwICYgcGxvdFxuZGVuZF80MCA8LSBjb2xvcl9icmFuY2hlcyhkZW5kX3BsYXllcnMsIGggPSA0MClcblxuIyBQbG90IHRoZSBkZW5kcm9ncmFtIHdpdGggY2x1c3RlcnMgY29sb3JlZCBiZWxvdyBoZWlnaHQgNDBcbnBsb3QoZGVuZF80MCkifQ==

用k means 做聚类分析的原理:

  1. 确定 k 个中心点。

  2. 计算每个样本与中心点的距离。

  3. 样本与哪个中心点距离近就被分配到哪个组。

  4. 如此这般,我们把所有的样本分成了k组。

  5. 然后我们找到每一个组的中心点。

  6. 然后,我们再次计算每个样本与新的中心点的距离。

  7. 根据每个样本点与新的中心点的距离,再次把样本分组。

  8. 如此循环往复,直到再没有点改变它的分组。

用k means 做聚类分析的步骤:

  1. 确定分组数量
  1. 从上述模型提取聚类结果。

  2. 将该结果与原数据合并。

  3. 分析每个组的特征。

k means 实操练习1:

跟之前一样,我们用lineup 这个数据,里面是开场前两个球队球员的场中位置。 因为我们知道这是两个队的比赛,所以我们的K=2,没毛病。

我们的目标是,把球员各归各队各找各妈。

我们在kmeans() 这个函数中,将参数k的取值规定为2.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpbmV1cCA8LSByZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy81b2xzeGhhOXVhbTEzdXovbGluZXVwLnR4dD9kbD0xXCIpIiwic2FtcGxlIjoibGlicmFyeShkcGx5cilcbmxpYnJhcnkoZ2dwbG90MilcblxuIyBCdWlsZCBhIGttZWFucyBtb2RlbFxubW9kZWxfa20yIDwtIGttZWFucyhfX18sIGNlbnRlcnMgPSBfX18pXG5cbiMgRXh0cmFjdCB0aGUgY2x1c3RlciBhc3NpZ25tZW50IHZlY3RvciBmcm9tIHRoZSBrbWVhbnMgbW9kZWxcbmNsdXN0X2ttMiA8LSBfX19cblxuIyBDcmVhdGUgYSBuZXcgZGF0YWZyYW1lIGFwcGVuZGluZyB0aGUgY2x1c3RlciBhc3NpZ25tZW50XG5saW5ldXBfa20yIDwtIG11dGF0ZShfX18sIGNsdXN0ZXIgPSBfX18pXG5cbiMgUGxvdCB0aGUgcG9zaXRpb25zIG9mIHRoZSBwbGF5ZXJzIGFuZCBjb2xvciB0aGVtIHVzaW5nIHRoZWlyIGNsdXN0ZXJcbmdncGxvdChfX18sIGFlcyh4ID0gX19fLCB5ID0gX19fLCBjb2xvciA9IGZhY3RvcihfX18pKSkgK1xuICBnZW9tX3BvaW50KCkiLCJzb2x1dGlvbiI6ImxpYnJhcnkoZHBseXIpXG5saWJyYXJ5KGdncGxvdDIpXG5cbiMgQnVpbGQgYSBrbWVhbnMgbW9kZWxcbm1vZGVsX2ttMiA8LSBrbWVhbnMobGluZXVwLCBjZW50ZXJzID0gMilcblxuIyBFeHRyYWN0IHRoZSBjbHVzdGVyIGFzc2lnbm1lbnQgdmVjdG9yIGZyb20gdGhlIGttZWFucyBtb2RlbFxuY2x1c3Rfa20yIDwtIG1vZGVsX2ttMiRjbHVzdGVyXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBhcHBlbmRpbmcgdGhlIGNsdXN0ZXIgYXNzaWdubWVudFxubGluZXVwX2ttMiA8LSBtdXRhdGUobGluZXVwLCBjbHVzdGVyID0gY2x1c3Rfa20yKVxuXG4jIFBsb3QgdGhlIHBvc2l0aW9ucyBvZiB0aGUgcGxheWVycyBhbmQgY29sb3IgdGhlbSB1c2luZyB0aGVpciBjbHVzdGVyXG5nZ3Bsb3QobGluZXVwX2ttMiwgYWVzKHggPSB4LCB5ID0geSwgY29sb3IgPSBmYWN0b3IoY2x1c3RlcikpKSArXG4gIGdlb21fcG9pbnQoKSJ9

k means 实操练习2:

Data preperation:

数据下载:https://archive.ics.uci.edu/ml/onlineRetailsets/Online+Retail

  1. 整体观察,有多少消费者,有多少缺失值

  2. 我们在消费者层面做研究,所以剔除那些缺失CustomerID的样本

  3. 将日期这一变量的格式规范化

  4. 选取一整年的数据(强迫症):

  5. 消费者行为特点具有地域性,我们选取一个国家作为研究对象吧:

  6. 看来数据主要来自英国,我选一个数量小一点的,德国吧。

  7. 这些收据里面,有的是购物的,有的是退货的,有必要做一下区分。C表示cancel退货。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJvbmxpbmVSZXRhaWwgPC0gcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3plbXdrMjl2Z3ZnYjg3dC9PbmxpbmUlMjBSZXRhaWwuY3N2P2RsPTFcIilcblxuc3RyKG9ubGluZVJldGFpbCkiLCJzb2x1dGlvbiI6ImxpYnJhcnkoZHBseXIpXG5cbm9ubGluZVJldGFpbCA8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvemVtd2syOXZndmdiODd0L09ubGluZSUyMFJldGFpbC5jc3Y/ZGw9MVwiKVxuXG5zdHIob25saW5lUmV0YWlsKVxuXG5cbm9ubGluZVJldGFpbCAlPiVcbiAgc2VsZWN0KEN1c3RvbWVySUQpICU+JVxuICBpcy5uYSgpICU+JVxuICBzdW0oKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIG9ubGluZVJldGFpbCAlPiVcbiAgZmlsdGVyKCFpcy5uYShDdXN0b21lcklEKSlcblxub25saW5lUmV0YWlsX25ldzwtb25saW5lUmV0YWlsX25ldyAlPiVcbiAgbXV0YXRlKEludm9pY2VEYXRlPWFzLkRhdGUoSW52b2ljZURhdGUsIGZvcm1hdCA9IFwiJW0vJWQvJXlcIikpXG5cbnJhbmdlKG9ubGluZVJldGFpbF9uZXckSW52b2ljZURhdGUpXG5vbmxpbmVSZXRhaWxfbmV3IDwtIHN1YnNldChvbmxpbmVSZXRhaWxfbmV3LCBJbnZvaWNlRGF0ZSA+PSBcIjIwMTAtMTItMDlcIilcbnJhbmdlKG9ubGluZVJldGFpbF9uZXckSW52b2ljZURhdGUpXG5cbnRhYmxlKG9ubGluZVJldGFpbF9uZXckQ291bnRyeSlcblxub25saW5lUmV0YWlsX25ldyA8LSBzdWJzZXQob25saW5lUmV0YWlsX25ldywgQ291bnRyeSA9PSBcIkdlcm1hbnlcIilcblxubGVuZ3RoKHVuaXF1ZShvbmxpbmVSZXRhaWxfbmV3JEludm9pY2VObykpXG5sZW5ndGgodW5pcXVlKG9ubGluZVJldGFpbF9uZXckQ3VzdG9tZXJJRCkpXG5cbiMgSWRlbnRpZnkgcmV0dXJuc1xub25saW5lUmV0YWlsX25ldyRpdGVtLnJldHVybiA8LSBncmVwbChcIkNcIiwgb25saW5lUmV0YWlsX25ldyRJbnZvaWNlTm8sIGZpeGVkPVRSVUUpIFxub25saW5lUmV0YWlsX25ldyRwdXJjaGFzZS5pbnZvaWNlIDwtIGlmZWxzZShvbmxpbmVSZXRhaWxfbmV3JGl0ZW0ucmV0dXJuPT1cIlRSVUVcIiwgMCwgMSkifQ==

做Customer segmentation,我们最关注的是each customer’s recency of last purchase, frequency of purchase, and monetary value. These three variables, collectively known as RFM, are often used in customer segmentation for marketing purposes。具体的参见维基百科:https://en.wikipedia.org/wiki/RFM_(customer_value)

Create customer-level data

我们现在的数据是收据水平的,消费者被nested在收据之间,就是一个消费者可能多次消费,有多张收据,因而出现在数行数据里面。我们要做的是消费者层面的分析,因为要重组数据到消费者层面。

具体地,我们要整个数据一个消费者是一行数据,没有重复的。

  1. recency,每个消费者上一次消费距离现在有多久了?

  2. frequency,消费的次数是多少?

  3. Monetary value,总共消费了多少?

  4. 有些人的Monetary value是负的,这可能是因为今年退货了去年买的东西,我们把负的统一设置成0。

  5. recency

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZHBseXIpXG5cbm9ubGluZVJldGFpbCA8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvemVtd2syOXZndmdiODd0L09ubGluZSUyMFJldGFpbC5jc3Y/ZGw9MVwiKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIG9ubGluZVJldGFpbCAlPiVcbiAgZmlsdGVyKCFpcy5uYShDdXN0b21lcklEKSkgJT4lXG4gIG11dGF0ZShJbnZvaWNlRGF0ZT1hcy5EYXRlKEludm9pY2VEYXRlLCBmb3JtYXQgPSBcIiVtLyVkLyV5XCIpKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIHN1YnNldChvbmxpbmVSZXRhaWxfbmV3LCBJbnZvaWNlRGF0ZSA+PSBcIjIwMTAtMTItMDlcIilcblxub25saW5lUmV0YWlsX25ldyA8LSBzdWJzZXQob25saW5lUmV0YWlsX25ldywgQ291bnRyeSA9PSBcIkdlcm1hbnlcIilcblxuIyBJZGVudGlmeSByZXR1cm5zXG5vbmxpbmVSZXRhaWxfbmV3JGl0ZW0ucmV0dXJuIDwtIGdyZXBsKFwiQ1wiLCBvbmxpbmVSZXRhaWxfbmV3JEludm9pY2VObywgZml4ZWQ9VFJVRSkgXG5vbmxpbmVSZXRhaWxfbmV3JHB1cmNoYXNlLmludm9pY2UgPC0gaWZlbHNlKG9ubGluZVJldGFpbF9uZXckaXRlbS5yZXR1cm49PVwiVFJVRVwiLCAwLCAxKSIsInNhbXBsZSI6ImxpYnJhcnkoZHBseXIpIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxucmVjZW5jeSA8LSBvbmxpbmVSZXRhaWxfbmV3ICU+JVxuICBmaWx0ZXIocHVyY2hhc2UuaW52b2ljZSA9PSAxKSAlPiVcbiAgbXV0YXRlKHJlY2VuY3kgPSBhcy5udW1lcmljKGRpZmZ0aW1lKFwiMjAxMS0xMi0xMFwiLCBhcy5EYXRlKEludm9pY2VEYXRlKSksIHVuaXRzPVwiZGF5c1wiKVxuICAgICAgICAgKVxucmVjZW5jeSA8LSBhZ2dyZWdhdGUocmVjZW5jeSB+IEN1c3RvbWVySUQsIGRhdGE9cmVjZW5jeSwgRlVOPW1pbiwgbmEucm09VFJVRSkifQ==
  1. Frequency
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZHBseXIpXG5cbm9ubGluZVJldGFpbCA8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvemVtd2syOXZndmdiODd0L09ubGluZSUyMFJldGFpbC5jc3Y/ZGw9MVwiKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIG9ubGluZVJldGFpbCAlPiVcbiAgZmlsdGVyKCFpcy5uYShDdXN0b21lcklEKSkgJT4lXG4gIG11dGF0ZShJbnZvaWNlRGF0ZT1hcy5EYXRlKEludm9pY2VEYXRlLCBmb3JtYXQgPSBcIiVtLyVkLyV5XCIpKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIHN1YnNldChvbmxpbmVSZXRhaWxfbmV3LCBJbnZvaWNlRGF0ZSA+PSBcIjIwMTAtMTItMDlcIilcblxub25saW5lUmV0YWlsX25ldyA8LSBzdWJzZXQob25saW5lUmV0YWlsX25ldywgQ291bnRyeSA9PSBcIkdlcm1hbnlcIilcblxuIyBJZGVudGlmeSByZXR1cm5zXG5vbmxpbmVSZXRhaWxfbmV3JGl0ZW0ucmV0dXJuIDwtIGdyZXBsKFwiQ1wiLCBvbmxpbmVSZXRhaWxfbmV3JEludm9pY2VObywgZml4ZWQ9VFJVRSkgXG5vbmxpbmVSZXRhaWxfbmV3JHB1cmNoYXNlLmludm9pY2UgPC0gaWZlbHNlKG9ubGluZVJldGFpbF9uZXckaXRlbS5yZXR1cm49PVwiVFJVRVwiLCAwLCAxKSIsInNhbXBsZSI6ImxpYnJhcnkoZHBseXIpIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxuXG5mcmVxdWVuY3k8LSBvbmxpbmVSZXRhaWxfbmV3ICU+JVxuICBmaWx0ZXIocHVyY2hhc2UuaW52b2ljZSA9PSAxKSAlPiVcbiAgc2VsZWN0KFwiQ3VzdG9tZXJJRFwiLFwiSW52b2ljZU5vXCIsIFwicHVyY2hhc2UuaW52b2ljZVwiKSAlPiVcbiAgYXJyYW5nZShDdXN0b21lcklEKVxuXG5mcmVxdWVuY3kgPC0gYWdncmVnYXRlKHB1cmNoYXNlLmludm9pY2UgfiBDdXN0b21lcklELCBkYXRhPWZyZXF1ZW5jeSwgRlVOPXN1bSwgbmEucm09VFJVRSlcblxuY29sbmFtZXMoZnJlcXVlbmN5KVtjb2xuYW1lcyhmcmVxdWVuY3kpPT1cInB1cmNoYXNlLmludm9pY2VcIl0gPC0gXCJmcmVxdWVuY3lcIiJ9
  1. Monetary Value of Customers
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZHBseXIpXG5cbm9ubGluZVJldGFpbCA8LSByZWFkLmNzdihcImh0dHBzOi8vd3d3LmRyb3Bib3guY29tL3MvemVtd2syOXZndmdiODd0L09ubGluZSUyMFJldGFpbC5jc3Y/ZGw9MVwiKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIG9ubGluZVJldGFpbCAlPiVcbiAgZmlsdGVyKCFpcy5uYShDdXN0b21lcklEKSkgJT4lXG4gIG11dGF0ZShJbnZvaWNlRGF0ZT1hcy5EYXRlKEludm9pY2VEYXRlLCBmb3JtYXQgPSBcIiVtLyVkLyV5XCIpKVxuXG5vbmxpbmVSZXRhaWxfbmV3IDwtIHN1YnNldChvbmxpbmVSZXRhaWxfbmV3LCBJbnZvaWNlRGF0ZSA+PSBcIjIwMTAtMTItMDlcIilcblxub25saW5lUmV0YWlsX25ldyA8LSBzdWJzZXQob25saW5lUmV0YWlsX25ldywgQ291bnRyeSA9PSBcIkdlcm1hbnlcIilcblxuIyBJZGVudGlmeSByZXR1cm5zXG5vbmxpbmVSZXRhaWxfbmV3JGl0ZW0ucmV0dXJuIDwtIGdyZXBsKFwiQ1wiLCBvbmxpbmVSZXRhaWxfbmV3JEludm9pY2VObywgZml4ZWQ9VFJVRSkgXG5vbmxpbmVSZXRhaWxfbmV3JHB1cmNoYXNlLmludm9pY2UgPC0gaWZlbHNlKG9ubGluZVJldGFpbF9uZXckaXRlbS5yZXR1cm49PVwiVFJVRVwiLCAwLCAxKSIsInNhbXBsZSI6ImxpYnJhcnkoZHBseXIpIiwic29sdXRpb24iOiJsaWJyYXJ5KGRwbHlyKVxuXG4jIFRvdGFsIHNwZW50IG9uIGVhY2ggaXRlbSBvbiBhbiBpbnZvaWNlXG5saWJyYXJ5KGRwbHlyKVxuXG5hbm51YWwuc2FsZXMgPC0gb25saW5lUmV0YWlsX25ldyAlPiVcbiAgZmlsdGVyKHB1cmNoYXNlLmludm9pY2UgPT0gMSkgJT4lXG4gIG11dGF0ZShBbW91bnQgPSBRdWFudGl0eSpVbml0UHJpY2UpXG5cbiMgQWdncmVnYXRlZCB0b3RhbCBzYWxlcyB0byBjdXN0b21lclxuYW5udWFsLnNhbGVzIDwtIGFnZ3JlZ2F0ZShBbW91bnQgfiBDdXN0b21lcklELCBkYXRhPWFubnVhbC5zYWxlcywgRlVOPXN1bSwgbmEucm09VFJVRSlcbm5hbWVzKGFubnVhbC5zYWxlcylbbmFtZXMoYW5udWFsLnNhbGVzKT09XCJBbW91bnRcIl0gPC0gXCJtb25ldGFyeVwiXG5cbiMgbWVyZ2UgYWxsIHRocmVlIHZhcmlhYmxlc1xuXG5saWJyYXJ5KGRwbHlyKVxuXG5jdXN0b21lcnMgPC0gbGVmdF9qb2luKHJlY2VuY3ksIGZyZXF1ZW5jeSwgYnk9XCJDdXN0b21lcklEXCIpICU+JVxuICBsZWZ0X2pvaW4oLixhbm51YWwuc2FsZXMsIGJ5PVwiQ3VzdG9tZXJJRFwiKSBcblxuaGlzdChjdXN0b21lcnMkbW9uZXRhcnkpXG5jdXN0b21lcnMkbW9uZXRhcnkgPC0gaWZlbHNlKGN1c3RvbWVycyRtb25ldGFyeSA8IDAsIDAsIGN1c3RvbWVycyRtb25ldGFyeSlcbmhpc3QoY3VzdG9tZXJzJG1vbmV0YXJ5KSJ9
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImN1c3RvbWVyczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzNiczc4c3Jybmp1d2NtdS9jdXN0b21lcnMuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KHB1cnJyKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cilcblxuIyBVc2UgbWFwX2RibCB0byBydW4gbWFueSBtb2RlbHMgd2l0aCB2YXJ5aW5nIHZhbHVlIG9mIGsgKGNlbnRlcnMpXG50b3Rfd2l0aGluc3MgPC0gbWFwX2RibCgxOjEwLCAgZnVuY3Rpb24oayl7XG4gIG1vZGVsIDwtIGttZWFucyh4ID0gY3VzdG9tZXJzWywyOjRdLCBjZW50ZXJzID0gaylcbiAgbW9kZWwkXG59KVxuXG4jIEdlbmVyYXRlIGEgZGF0YSBmcmFtZSBjb250YWluaW5nIGJvdGggayBhbmQgdG90X3dpdGhpbnNzXG5lbGJvd19kZiA8LSBkYXRhLmZyYW1lKFxuICBrID0gMToxMCxcbiAgdG90X3dpdGhpbnNzID0gdG90X3dpdGhpbnNzXG4pXG5cbiMgUGxvdCB0aGUgZWxib3cgcGxvdFxuZ2dwbG90KGVsYm93X2RmLCBhZXMoeCA9IGssIHkgPSB0b3Rfd2l0aGluc3MpKSArXG4gIGdlb21fbGluZSgpICtcbiAgc2NhbGVfeF9jb250aW51b3VzKGJyZWFrcyA9IDE6MTApXG5cbiMgQnVpbGQgYSBrbWVhbnMgbW9kZWxcbm1vZGVsX2ttIDwtIGttZWFucyhjdXN0b21lcnNbLDI6NF0sIClcblxuIyBFeHRyYWN0IHRoZSBjbHVzdGVyIGFzc2lnbm1lbnQgdmVjdG9yIGZyb20gdGhlIGttZWFucyBtb2RlbFxuY2x1c3Rfa20gPC0gbW9kZWxfa20kXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBhcHBlbmRpbmcgdGhlIGNsdXN0ZXIgYXNzaWdubWVudFxuY3VzdG9tZXJzX2NsdXN0ZXIgPC0gbXV0YXRlKGN1c3RvbWVycywgY2x1c3RlciA9IClcblxuIyBQbG90IHRoZSBwb3NpdGlvbnMgb2YgdGhlIHBsYXllcnMgYW5kIGNvbG9yIHRoZW0gdXNpbmcgdGhlaXIgY2x1c3RlclxuZ2dwbG90KGN1c3RvbWVyc19jbHVzdGVyLCBhZXMoeCA9IGxvZyhyZWNlbmN5KSwgeSA9IGxvZyhtb25ldGFyeSksIGNvbG9yID0gZmFjdG9yKGNsdXN0ZXIpKSkiLCJzb2x1dGlvbiI6ImxpYnJhcnkocHVycnIpXG5saWJyYXJ5KGdncGxvdDIpXG5saWJyYXJ5KGRwbHlyKVxuICBcbiMgVXNlIG1hcF9kYmwgdG8gcnVuIG1hbnkgbW9kZWxzIHdpdGggdmFyeWluZyB2YWx1ZSBvZiBrIChjZW50ZXJzKVxudG90X3dpdGhpbnNzIDwtIG1hcF9kYmwoMToxMCwgIGZ1bmN0aW9uKGspe1xuICBtb2RlbCA8LSBrbWVhbnMoeCA9IGN1c3RvbWVyc1ssMjo0XSwgY2VudGVycyA9IGspXG4gIG1vZGVsJHRvdC53aXRoaW5zc1xufSlcblxuIyBHZW5lcmF0ZSBhIGRhdGEgZnJhbWUgY29udGFpbmluZyBib3RoIGsgYW5kIHRvdF93aXRoaW5zc1xuZWxib3dfZGYgPC0gZGF0YS5mcmFtZShcbiAgayA9IDE6MTAsXG4gIHRvdF93aXRoaW5zcyA9IHRvdF93aXRoaW5zc1xuKVxuXG4jIFBsb3QgdGhlIGVsYm93IHBsb3RcbmdncGxvdChlbGJvd19kZiwgYWVzKHggPSBrLCB5ID0gdG90X3dpdGhpbnNzKSkgK1xuICBnZW9tX2xpbmUoKSArXG4gIHNjYWxlX3hfY29udGludW91cyhicmVha3MgPSAxOjEwKSJ9
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImN1c3RvbWVyczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzNiczc4c3Jybmp1d2NtdS9jdXN0b21lcnMuY3N2P2RsPTFcIikiLCJzYW1wbGUiOiJsaWJyYXJ5KHB1cnJyKVxubGlicmFyeShnZ3Bsb3QyKVxubGlicmFyeShkcGx5cikiLCJzb2x1dGlvbiI6IiMgQnVpbGQgYSBrbWVhbnMgbW9kZWxcbm1vZGVsX2ttIDwtIGttZWFucyhjdXN0b21lcnNbLDI6NF0sIGNlbnRlcnMgPSAyKVxuXG4jIEV4dHJhY3QgdGhlIGNsdXN0ZXIgYXNzaWdubWVudCB2ZWN0b3IgZnJvbSB0aGUga21lYW5zIG1vZGVsXG5jbHVzdF9rbSA8LSBtb2RlbF9rbSRjbHVzdGVyXG5cbiMgQ3JlYXRlIGEgbmV3IGRhdGFmcmFtZSBhcHBlbmRpbmcgdGhlIGNsdXN0ZXIgYXNzaWdubWVudFxuY3VzdG9tZXJzX2NsdXN0ZXIgPC0gbXV0YXRlKGN1c3RvbWVycywgY2x1c3RlciA9IGNsdXN0X2ttKVxuXG4jIFBsb3QgdGhlIHBvc2l0aW9ucyBvZiB0aGUgcGxheWVycyBhbmQgY29sb3IgdGhlbSB1c2luZyB0aGVpciBjbHVzdGVyXG5nZ3Bsb3QoY3VzdG9tZXJzX2NsdXN0ZXIsIGFlcyh4ID0gbG9nKHJlY2VuY3kpLCB5ID0gbG9nKG1vbmV0YXJ5KSwgY29sb3IgPSBmYWN0b3IoY2x1c3RlcikpKSArXG4gIGdlb21fcG9pbnQoKSJ9
炫酷的3D图:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjdXN0b21lcnM8LXJlYWQuY3N2KFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy8zYnM3OHNycm5qdXdjbXUvY3VzdG9tZXJzLmNzdj9kbD0xXCIpXG5saWJyYXJ5KHB1cnJyKVxubGlicmFyeShjYXIpXG5saWJyYXJ5KHJnbClcblxuIyBQbG90IGNsdXN0ZXJzIGluIDNEXG5cbmNvbG9ycyA8LSBjKCdyZWQnLCdvcmFuZ2UnLCdncmVlbjMnLCdkZWVwc2t5Ymx1ZScsJ2JsdWUnLCdkYXJrb3JjaGlkNCcsJ3Zpb2xldCcsJ3BpbmsxJywndGFuMycsJ2JsYWNrJylcblxuIyBCdWlsZCBhIGttZWFucyBtb2RlbFxubW9kZWxfa20gPC0ga21lYW5zKGN1c3RvbWVyc1ssMjo0XSwgY2VudGVycyA9IDIpXG5cbiMgRXh0cmFjdCB0aGUgY2x1c3RlciBhc3NpZ25tZW50IHZlY3RvciBmcm9tIHRoZSBrbWVhbnMgbW9kZWxcbmNsdXN0X2ttIDwtIG1vZGVsX2ttJGNsdXN0ZXJcblxuIyBDcmVhdGUgYSBuZXcgZGF0YWZyYW1lIGFwcGVuZGluZyB0aGUgY2x1c3RlciBhc3NpZ25tZW50XG5jdXN0b21lcnNfY2x1c3RlciA8LSBtdXRhdGUoY3VzdG9tZXJzLCBjbHVzdGVyID0gY2x1c3Rfa20pXG5cbnNjYXR0ZXIzZCh4ID0gbG9nKGN1c3RvbWVyc19jbHVzdGVyJGZyZXF1ZW5jeSksIFxuICAgICAgICAgIHkgPSBsb2coY3VzdG9tZXJzX2NsdXN0ZXIkbW9uZXRhcnkpLFxuICAgICAgICAgIHogPSBsb2coY3VzdG9tZXJzX2NsdXN0ZXIkcmVjZW5jeSksIFxuICAgICAgICAgIGdyb3VwcyA9IGZhY3RvcihjdXN0b21lcnNfY2x1c3RlciRjbHVzdGVyKSxcbiAgICAgICAgICB4bGFiID0gXCJGcmVxdWVuY3kgKExvZy10cmFuc2Zvcm1lZClcIiwgXG4gICAgICAgICAgeWxhYiA9IFwiTW9uZXRhcnkgVmFsdWUgKGxvZy10cmFuc2Zvcm1lZClcIixcbiAgICAgICAgICB6bGFiID0gXCJSZWNlbmN5IChMb2ctdHJhbnNmb3JtZWQpXCIsXG4gICAgICAgICAgc3VyZmFjZS5jb2wgPSBjb2xvcnMsXG4gICAgICAgICAgYXhpcy5zY2FsZXMgPSBGQUxTRSxcbiAgICAgICAgICBzdXJmYWNlID0gVFJVRSwgIyBwcm9kdWNlcyB0aGUgaG9yaXpvbmFsIHBsYW5lcyB0aHJvdWdoIHRoZSBncmFwaCBhdCBlYWNoIGxldmVsIG9mIG1vbmV0YXJ5IHZhbHVlXG4gICAgICAgICAgZml0ID0gXCJzbW9vdGhcIixcbiAgICAgICAgICAjICAgICBlbGxpcHNvaWQgPSBUUlVFLCAjIHRvIGdyYXBoIGVsbGlwc2VzIHVzZXMgdGhpcyBjb21tYW5kIGFuZCBzZXQgXCJzdXJmYWNlID0gXCIgdG8gRkFMU0VcbiAgICAgICAgICBncmlkID0gVFJVRSxcbiAgICAgICAgICBheGlzLmNvbCA9IGMoXCJibGFja1wiLCBcImJsYWNrXCIsIFwiYmxhY2tcIikpXG5cbnJlbW92ZShjb2xvcnMpIn0=

80/20 Rule

二八法则,百分之二十的消费者贡献了百分之八十的消费。 这百分之二十的消费者是最有价值的,商家想要留住他们,吸引他们多购物。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjdXN0b21lcnMgPC0gY3VzdG9tZXJzICU+JVxuICBhcnJhbmdlKC1tb25ldGFyeSkifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEFwcGx5IFBhcmV0byBQcmluY2lwbGUgKDgwLzIwIFJ1bGUpXG5wYXJldG8uY3V0b2ZmIDwtIDAuOCAqIHN1bShjdXN0b21lcnMkbW9uZXRhcnkpXG5jdXN0b21lcnMkcGFyZXRvIDwtIGlmZWxzZShjdW1zdW0oY3VzdG9tZXJzJG1vbmV0YXJ5KSA8PSBwYXJldG8uY3V0b2ZmLCBcIkhpZ2hcIiwgXCJMb3dcIilcbmN1c3RvbWVycyRwYXJldG8gPC0gZmFjdG9yKGN1c3RvbWVycyRwYXJldG8sIGxldmVscz1jKFwiSGlnaFwiLCBcIkxvd1wiKSwgb3JkZXJlZD1UUlVFKVxubGV2ZWxzKGN1c3RvbWVycyRwYXJldG8pIn0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJyb3VuZChwcm9wLnRhYmxlKHRhYmxlKGN1c3RvbWVycyRwYXJldG8pKSwgMilcblxucmVtb3ZlKHBhcmV0by5jdXRvZmYpXG5cbmN1c3RvbWVycyA8LSBjdXN0b21lcnNbb3JkZXIoY3VzdG9tZXJzJEN1c3RvbWVySUQpLF0ifQ==

具体就我们的数据来说,上层35%的德国消费者贡献80%的消费额。