主要議題:依字頻表對文章分群
學習重點:
- 依字頻表對文章分群
- 層級式集群分析:Hierarchical Cluster Analysis
- 依據樹狀圖決定要分多少群
- 依據應用決定要分多少群
- K-Means集群分析:K-Means Cluster Analysis
- 從常見字辭推論文集的主題
小組討論:
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(dplyr)
1. Hierarchical Clustering
1.1 字頻表、距離矩陣、階層式集群分析
Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.
MIT做法
# dailykos = read.csv("data/dailykos.csv")
# kosDist = dist(dailykos, method="euclidean")
# kosHierClust = hclust(kosDist, method="ward.D")
D = read.csv('data/dailykos.csv')
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
dim(D)
[1] 3430 1545
# 字頻表: Document Term Matrix
D[1:20, 1:10]
# 距離矩陣: Distance Matrix
t0 = Sys.time()
d = dist(D, method="euclidean")
Sys.time() - t0
1.1小組討論
用歐幾里德方法method=“euclidean”,計算資料點的距離distance的矩陣,並儲存成d物件。
Running the dist function will probably take you a while. Why? Select all that apply.
We have a lot of observations, so it takes a long time to compute the distance between each pair of observations.
We have a lot of variables, so the distance computation is long.
# 階層式集群分析: Hierarchical Clustering Analysis
t0 = Sys.time()
hc = hclust(d, method='ward.D')
Sys.time() - t0
Plot the dendrogram of your hierarchical clustering model.
plot(hc)
1.2 從樹狀圖判斷群數
Just looking at the dendrogram,
MIT做法
plot(kosHierClust)
which of the following seem like good choices for the number of clusters? Select all that apply.
1.2小組討論
根據樹狀圖,選擇分成2個或3個群落,垂直的距離很長,代表有很大的空間,群與群之間差異很大,分配掉很多的資料量,
1.3 從應用決定群數
In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application,
分成2或3群會太廣泛而沒什麼幫助,分成7或8群比較洽當合理。
what are good choices for the number of clusters? Select all that apply.
1.4 依群組分割資料
MIT做法
rr L[[1]] %>% colMeans %>% sort %>% tail
state republican poll democrat kerry bush
0.7575 0.7591 0.9036 0.9194 1.0624 1.7054
1.4小組討論
用cutree function把資料spilt成七個group 。
Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.
kg = cutree(hc, k=7)
L = split(D, kg)
Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.
How many observations are in cluster 3?
nrow(L[[3]])
table(kg) %>% sort
Which cluster has the most observations?
Which cluster has the fewest observations?
1.5 找出第一族群中最常見的字辭
Instead of looking at the average value in each variable individually, we’ll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where “HierCluster1” should be replaced with the name of your first cluster subset):
tail(sort(colMeans(HierCluster1)))
This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.
MIT做法
tail(sort(colMeans(HierCluster1)))
print("bush")
What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:
L[[1]] %>% colMeans %>% sort %>% tail
1.6 找出各族群中最常見的字辭
Now repeat the command given in the previous problem for each of the other clusters, and answer the following questions.
MIT
tail(sort(colMeans(HierCluster2)))
tail(sort(colMeans(HierCluster3)))
tail(sort(colMeans(HierCluster4)))
tail(sort(colMeans(HierCluster5)))
tail(sort(colMeans(HierCluster6)))
tail(sort(colMeans(HierCluster7)))
sapply(L, function(x) x %>% colMeans %>% sort %>% tail %>% names) %>% t
Which words best describe cluster 2?
Which cluster could best be described as the cluster related to the Iraq war?
In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information,
which cluster best corresponds to the democratic party?
2 K-Means Clustering
2.1 K-Means集群分析
Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.
MIT
set.seed(1000)
KmeansCluster = kmeans(dailykos, centers=7)
KmeansCluster1 = subset(dailykos, KmeansCluster$cluster == 1)
KmeansCluster2 = subset(dailykos, KmeansCluster$cluster == 2)
KmeansCluster3 = subset(dailykos, KmeansCluster$cluster == 3)
KmeansCluster4 = subset(dailykos, KmeansCluster$cluster == 4)
KmeansCluster5 = subset(dailykos, KmeansCluster$cluster == 5)
KmeansCluster6 = subset(dailykos, KmeansCluster$cluster == 6)
KmeansCluster7 = subset(dailykos, KmeansCluster$cluster == 7)
set.seed(1000)
km = kmeans(D, 7)
kg2 = km$cluster
table(km$cluster) %>% sort
Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.
How many observations are in Cluster 3?
Which cluster has the most observations?
Which cluster has the fewest number of observations?
2.2 找出各族群中最常見的字辭
Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.
MIT
tail(sort(colMeans(KmeansCluster1)))
tail(sort(colMeans(KmeansCluster2)))
tail(sort(colMeans(KmeansCluster3)))
tail(sort(colMeans(KmeansCluster4)))
tail(sort(colMeans(KmeansCluster5)))
tail(sort(colMeans(KmeansCluster6)))
tail(sort(colMeans(KmeansCluster7)))
split(D, kg2) %>% sapply(function(x)
x %>% colMeans %>% sort %>% tail %>% names) %>% t
Which k-means cluster best corresponds to the Iraq War?
Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)
2.3 ~ 2.6 兩種分群結果之間的對應關係
For the rest of this problem, we’ll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.
MIT
table(hierGroups, KmeansCluster$cluster)
table(Hierarchical=kg, KMeans=kg2)
Which Hierarchical Cluster best corresponds to K-Means Cluster 2?
Which Hierarchical Cluster best corresponds to K-Means Cluster 3?
Which Hierarchical Cluster best corresponds to K-Means Cluster 7?
- No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7.
Which Hierarchical Cluster best corresponds to K-Means Cluster 6?
【討論問題】
字頻表是什麼?它的資料格式?
- 字頻表用Document term matrix表示,將所有document的所有字攤開來放在column,row是每個document。row 1的abandon欄位是0意思是documment1沒有abandon 這個字;row 3的abstain欄位是1意思是abstain這個字出現在document3 1次。其他的每個值以此規則類推。
- 他是一個Document term matrix,我們以class函數可以看到在R是用data.frame的資料型態存放。
使用字頻表作集群分析時,區隔變數是什麼?
- 資料裡面colnames(變數)所出現在數量,藉由次數高低,做為是否區隔的門檻。
- 數量為字頻,數字越大表示該字出現在document的次數越高。
從樹狀圖判斷群數和從應用需求決定群數有什麼差別?
- 一般以樹狀圖所出現的推薦的群數並不一定和你最後所做的決策群數完全相符,有時還是得考慮到現實面。比如樹狀圖最好的方式是三群(三群時高度差最大),可是考慮到現實面假設資料庫太大,三群可能沒辦法有效分類,有時得選擇更多群去解釋資料。
- 有時候可能用樹狀圖告訴我們分成10群很棒,但有時候我們做決策可能只是希望依照不同族群的特性制定一些不同的方案。所以我們自己衡量要將群數增加還是減少,以對應我們想要做的決策數量。
小組心得
透過Hierarchical Clustering可以幫助我們將大筆資料分成數個群聚,並從樹狀圖可以觀察到最好的分群數量,但實際上的應用,還是得經由自身經驗判斷適當的分群數,並非一味的相信樹狀圖顯示的結果,此外,還可以透過階層分群來得知該群聚中,最常出現的字詞有哪些,從而知道這些群聚的特性。
