cutting the tree!

如果事先不知道几个组合适,怎么办?

根据dendrogram的高度来砍一刀,确定我们得到几个组?

先安装这个 install.packages(“dendextend”)

这里我们来借用dendrogram,确定一个最高的高度(h),砍一刀,在这一刀下面,我们得以把观测样本分成不同的组。

我们会使用dendextend包里的color_branches()函数来将分组涂色。

lineup <- read.delim("https://www.dropbox.com/s/5olsxha9uam13uz/lineup.txt?dl=1")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.9.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
plot(hc_players)

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

在h = 40处来一刀?

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)

用k means 做聚类分析的原理:

  1. 确定 k 个中心点。

  2. 计算每个样本与中心点的距离。

  3. 样本与哪个中心点距离近就被分配到哪个组。

  4. 如此这般,我们把所有的样本分成了k组。

  5. 然后我们找到每一个组的中心点。

  6. 然后,我们再次计算每个样本与新的中心点的距离。

  7. 根据每个样本点与新的中心点的距离,再次把样本分组。

  8. 如此循环往复,直到再没有点改变它的分组。

用k means 做聚类分析的步骤:

  1. 确定分组数量
  1. 从上述模型提取聚类结果。

  2. 将该结果与原数据合并。

  3. 分析每个组的特征。

k means 实操练习1:

跟之前一样,我们用lineup 这个数据,里面是开场前两个球队球员的场中位置。 因为我们知道这是两个队的比赛,所以我们的K=2,没毛病。

我们的目标是,把球员各归各队各找各妈。

我们在kmeans() 这个函数中,将参数k的取值规定为2.

lineup <- read.delim("https://www.dropbox.com/s/5olsxha9uam13uz/lineup.txt?dl=1")
library(dplyr)
library(ggplot2)

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster

# Create a new dataframe appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()