主要議題:依字頻表對文章分群
學習重點:
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
## [1] "C"
options(digits=4, scipen=12)
library(dplyr)
Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.
D = read.csv("data/dailykos.csv")
Ddis = dist(D, method="euclidean") #算出距離矩陣
# 字頻表: Document Term Matrix
D[1:20, 1:10]
## abandon abc ability abortion absolute abstain abu abuse accept access
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 1 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0
## 15 0 0 0 0 0 0 0 0 0 0
## 16 0 0 0 0 0 0 0 0 0 0
## 17 0 0 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0 0
Running the dist function will probably take you a while. Why? Select all that apply.
Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the following seem like good choices for the number of clusters? Select all that apply.
hcl = hclust(Ddis, method="ward.D") #生出樹狀圖
plot(hcl) #畫出樹狀圖
Just looking at the dendrogram,
which of the following seem like good choices for the number of clusters? Select all that apply.
In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application,
what are good choices for the number of clusters? Select all that apply.
Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.
chcl = cutree(hcl, k=7)
L = split(D, chcl)
Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.
How many observations are in cluster 3?
nrow(L[[3]]) #double[]看第3群
## [1] 374
Which cluster has the most observations?
table(chcl) #第1群
## chcl
## 1 2 3 4 5 6 7
## 1266 321 374 139 407 714 209
Which cluster has the fewest observations?
#第4群
Instead of looking at the average value in each variable individually, we’ll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where “HierCluster1” should be replaced with the name of your first cluster subset):
tail(sort(colMeans(HierCluster1)))
This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.
What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:
L[[1]] %>% colMeans %>% sort %>% tail #colmeans計算每一個欄的mean值
## state republican poll democrat kerry bush
## 0.7575 0.7591 0.9036 0.9194 1.0624 1.7054
#記得[[n]]指資料的n群
Now repeat the command given in the previous problem for each of the other clusters, and answer the following questions.
sapply(L, function(n) n %>% colMeans %>% sort %>% tail %>% names) %>% t
## [,1] [,2] [,3] [,4] [,5]
## 1 "state" "republican" "poll" "democrat" "kerry"
## 2 "bush" "democrat" "challenge" "vote" "poll"
## 3 "elect" "parties" "state" "republican" "democrat"
## 4 "campaign" "voter" "presided" "poll" "bush"
## 5 "american" "presided" "administration" "war" "iraq"
## 6 "race" "bush" "kerry" "elect" "democrat"
## 7 "democrat" "clark" "edward" "poll" "kerry"
## [,6]
## 1 "bush"
## 2 "november"
## 3 "bush"
## 4 "kerry"
## 5 "bush"
## 6 "poll"
## 7 "dean"
#sapply(x, FUN)
#t代表轉置表格
Which words best describe cluster 2?
Which cluster could best be described as the cluster related to the Iraq war?
which cluster best corresponds to the democratic party?
Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.
set.seed(1000)
km = kmeans(D, 7) #用k-means的方式做模型,需先指定要有k群
ckm = km$cluster
table(km$cluster) %>% sort
##
## 2 1 5 3 7 6 4
## 144 146 163 277 308 329 2063
Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.
How many observations are in Cluster 3? + 277
Which cluster has the most observations? + Cluster 4
Which cluster has the fewest number of observations? + Cluster 2
Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.
library(dplyr)
L2 = split(D, ckm)
sapply(L2, function(x) x %>% colMeans %>% sort %>% tail %>% names)
## 1 2 3 4
## [1,] "state" "primaries" "administration" "elect"
## [2,] "iraq" "democrat" "iraqi" "republican"
## [3,] "kerry" "edward" "american" "kerry"
## [4,] "administration" "clark" "bush" "poll"
## [5,] "presided" "kerry" "war" "democrat"
## [6,] "bush" "dean" "iraq" "bush"
## 5 6 7
## [1,] "race" "democrat" "presided"
## [2,] "senate" "bush" "voter"
## [3,] "state" "challenge" "campaign"
## [4,] "parties" "vote" "poll"
## [5,] "republican" "poll" "bush"
## [6,] "democrat" "november" "kerry"
Which k-means cluster best corresponds to the Iraq War?
Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)
For the rest of this problem, we’ll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.
table(chcl,ckm) #table裡面的數字代表什麼意思??
## ckm
## chcl 1 2 3 4 5 6 7
## 1 3 11 64 1045 32 0 111
## 2 0 0 0 0 0 320 1
## 3 85 10 42 79 126 8 24
## 4 10 5 0 0 1 0 123
## 5 48 0 171 145 3 1 39
## 6 0 2 0 712 0 0 0
## 7 0 116 0 82 1 0 10
#table(Hierarchical=kg, KMeans=kg2) 為什麼老師可以命名表頭?
Which Hierarchical Cluster best corresponds to K-Means Cluster 2? + Hierarchical Cluster 7
Which Hierarchical Cluster best corresponds to K-Means Cluster 3? + Hierarchical Cluster 5
Which Hierarchical Cluster best corresponds to K-Means Cluster 7? + No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7.
Which Hierarchical Cluster best corresponds to K-Means Cluster 6? + Hierarchical Cluster 2
字頻表是什麼?它的資料格式?
使用字頻表作集群分析時,區隔變數是什麼?
從樹狀圖判斷群數和從應用需求決定群數有什麼差別?