主要議題:依字頻表對文章分群

學習重點:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(dplyr)



0. 名詞解釋(上課筆記)

1.1 資料探勘

剛開始得到一份資料,還不清楚整個資料的樣貌,沒有目的性的,多多嘗試就會發現一些有趣的現象。EX.AS6-0 Wholesales, Movies

1.2 分群管理(市場區隔)

為了要分群管理(有目的性),區隔變數的選擇會選跟那些目的有關的來做市場區隔。EX.AS6-2 Airlines

1.3 形態偵測(預測性診斷)

這種分群會分得很細,但重點不是在分群,而是為了去找那些少數幾個長相特別、明顯有特徵的小群(EX.60個裡面5~6個準確度高於平均),並拿這些顯著不同的幾種pattern去做預測。

1.4 分群預測(集群分析模型、分群預測模型)
  • 分完群再做預測(EX.很多小疾病會導致心臟病,但每個歷程皆不同)以專業知識判斷來分群,再來做預測。EX.AS6-3 Stocks
  • 數類模型(可以直接從數量上看出關係的模型)的分群可以交給機器去處理,因為純粹以數字、統計顯著性來分群,機器做的一定比人好。
  • 但若需要靠專業知識來判斷(EX.股票、醫療),就要先以人類專家的判斷做分群,再給機器做接下來的預測,相輔相乘,通常預測也會比較準確。

1. Hierarchical Clustering

1.1 字頻表、距離矩陣、階層式集群分析

Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.

D = read.csv("dailykos.csv")
dim(D)
[1] 3430 1545
# 字頻表: Document Term Matrix
D[1:20, 1:10]
   abandon abc ability abortion absolute abstain abu abuse accept
1        0   0       0        0        0       0   0     0      0
2        0   0       0        0        0       0   0     0      0
3        0   0       0        0        0       1   0     0      0
4        0   0       0        0        0       0   0     0      0
5        0   0       0        0        0       0   0     0      0
6        0   0       0        0        0       0   0     0      0
7        0   0       0        0        0       0   0     0      0
8        0   0       0        0        0       0   0     0      0
9        0   0       0        0        0       0   0     0      0
10       0   0       0        0        0       0   0     0      0
11       0   0       0        0        0       0   0     0      0
12       0   0       0        0        0       1   0     0      0
13       0   0       0        0        0       0   0     0      0
14       0   0       0        0        0       0   0     0      0
15       0   0       0        0        0       0   0     0      0
16       0   0       0        0        0       0   0     0      0
17       0   0       0        0        0       0   0     0      0
18       0   0       0        0        0       0   0     0      0
19       0   0       0        0        0       0   0     0      0
20       0   0       0        0        0       0   0     0      0
   access
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
# 距離矩陣: Distance Matrix
t0 = Sys.time()
dist = dist(D, method = "euclidean")
Sys.time() - t0
Time difference of 3.501 mins

Running the dist function will probably take you a while. Why? Select all that apply.

# 階層式集群分析: Hierarchical Clustering Analysis
t0 = Sys.time()
clustD = hclust(dist, method = "ward.D")
Sys.time() - t0
Time difference of 0.625 secs

Plot the dendrogram of your hierarchical clustering model.

table(clustGroup,clustKM$cluster)
          
clustGroup    1    2    3    4    5    6    7
         1    3   11   64 1045   32    0  111
         2    0    0    0    0    0  320    1
         3   85   10   42   79  126    8   24
         4   10    5    0    0    1    0  123
         5   48    0  171  145    3    1   39
         6    0    2    0  712    0    0    0
         7    0  116    0   82    1    0   10
1.2 從樹狀圖判斷群數

Just looking at the dendrogram,

which of the following seem like good choices for the number of clusters? Select all that apply.

  • 2
  • 3
  • The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots.
1.3 從應用決定群數

In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application,

what are good choices for the number of clusters? Select all that apply.

  • 7
  • 8
  • It is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.
1.4 依群組分割資料

Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.

clustGroup = cutree(clustD, k=7)
clust1 = subset(D, clustGroup==1) # the most obs.
clust2 = subset(D, clustGroup==2)
clust3 = subset(D, clustGroup==3) # 374 obs.
clust4 = subset(D, clustGroup==4) # the fewest obs.
clust5 = subset(D, clustGroup==5)
clust6 = subset(D, clustGroup==6)
clust7 = subset(D, clustGroup==7)
clustHi = split(D, clustGroup) # 也可以將以上code簡化,用此行來直接分群

Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

How many observations are in cluster 3?

nrow(clust3) # 374 obs.
[1] 374
View(clustHi) # 或是從環境變數直接看每個群集有幾個變數

Which cluster has the most observations?

  • 1

Which cluster has the fewest observations?

  • 4
1.5 找出第一族群中最常見的字辭

Instead of looking at the average value in each variable individually, we’ll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where “HierCluster1” should be replaced with the name of your first cluster subset):

tail(sort(colMeans(HierCluster1)))

This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.

What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:

tail(sort(colMeans(clust1))) # bush
     state republican       poll   democrat      kerry       bush 
    0.7575     0.7591     0.9036     0.9194     1.0624     1.7054 
# colMeans是以欄(這裡的每個欄位變數都是一個word)為單位來計算平均值,就此例題,欄位的平均值代表該word出現的比率。

This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently.

1.6 找出各族群中最常見的字辭

Now repeat the command given in the previous problem for each of the other clusters, and answer the following questions.

tail(sort(colMeans(clust2))) 
     bush  democrat challenge      vote      poll  november 
    2.847     2.850     4.097     4.399     4.847    10.340 
tail(sort(colMeans(clust3))) 
     elect    parties      state republican   democrat       bush 
     1.647      1.666      2.321      2.524      3.824      4.406 
tail(sort(colMeans(clust4))) 
campaign    voter presided     poll     bush    kerry 
   1.432    1.540    1.626    3.590    7.835    8.439 
tail(sort(colMeans(clust5))) 
      american       presided administration            war 
         1.091          1.120          1.231          1.776 
          iraq           bush 
         2.428          3.941 
tail(sort(colMeans(clust6)))
    race     bush    kerry    elect democrat     poll 
  0.4580   0.4888   0.5168   0.5350   0.5644   0.5812 
tail(sort(colMeans(clust7)))
democrat    clark   edward     poll    kerry     dean 
   2.148    2.498    2.608    2.766    3.952    5.804 

Which words best describe cluster 2?

  • november, poll, vote
  • 由此可看出群集2的人,討論許多關於11月時的選舉。

Which cluster could best be described as the cluster related to the Iraq war?

  • 5
  • Iraq的關鍵字只出現在群集5,並且高達2.428,其他的關鍵字有war,美國總統Bush和American,也反應出跟伊拉克戰爭特別相關。

In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information,

which cluster best corresponds to the democratic party?

  • 7
  • 群集7除了有democrat關鍵字以外,Edward,Kerry和Dean都是民主黨的選舉人代表或支持者,因此即使democrat本身出現的次數比其他群集中還少,由於整個群集都是由民主黨相關的成分所組成,因此最有可能是民主黨。



2 K-Means Clustering

2.1 K-Means集群分析

Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.

library(caTools)
set.seed(1000)
clustKM = kmeans(D, centers=7)
clustKMgroup = split(D, clustKM$cluster)
nrow(clustKM[[3]])
NULL
View(clustKM)
# cluster4 has the most obs, while cluster2 has the fewest obs.

Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.

How many observations are in Cluster 3?

  • 277
  • 由於K means的分群方式跟Hierarchical不同,得出各個群集中的obs值也自然不同,並且所有群集的順序(cluster 1~7)是沒有意義的,只是一個代號。

Which cluster has the most observations?

  • 4

Which cluster has the fewest number of observations?

  • 2
2.2 找出各族群中最常見的字辭

Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.

tail(sort(colMeans(clustKMgroup[[1]])))
         state           iraq          kerry administration 
         1.610          1.616          1.637          2.664 
      presided           bush 
         2.767         11.432 
tail(sort(colMeans(clustKMgroup[[2]])))
primaries  democrat    edward     clark     kerry      dean 
    2.319     2.694     2.799     3.090     4.979     8.278 
tail(sort(colMeans(clustKMgroup[[3]])))
administration          iraqi       american           bush 
         1.390          1.610          1.686          2.610 
           war           iraq 
         3.025          4.094 
tail(sort(colMeans(clustKMgroup[[4]])))
     elect republican      kerry       poll   democrat       bush 
    0.6011     0.6175     0.6495     0.7475     0.7891     1.1474 
tail(sort(colMeans(clustKMgroup[[5]])))
      race     senate      state    parties republican   democrat 
     2.485      2.650      3.521      3.620      4.638      6.994 
tail(sort(colMeans(clustKMgroup[[6]])))
 democrat      bush challenge      vote      poll  november 
    2.900     2.960     4.122     4.447     4.872    10.371 
tail(sort(colMeans(clustKMgroup[[7]])))
presided    voter campaign     poll     bush    kerry 
   1.325    1.334    1.383    2.789    5.971    6.481 

Which k-means cluster best corresponds to the Iraq War?

  • 3
  • 雖然群集1跟3皆有出現Iraq,但是只有群集3有其他伊拉克戰爭相關的關鍵字,如:american,war和美國總統Bush,並且Iraq的次數也遠高於群集1的。

Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)

  • 2
  • 這邊的關鍵字與Hierarchical分析一樣,有democrat,Edward,Kerry和Dean,因此群集2最有可能是民主黨。
2.3 ~ 2.6 兩種分群結果之間的對應關係

For the rest of this problem, we’ll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.

table(clustGroup,clustKM$cluster)
          
clustGroup    1    2    3    4    5    6    7
         1    3   11   64 1045   32    0  111
         2    0    0    0    0    0  320    1
         3   85   10   42   79  126    8   24
         4   10    5    0    0    1    0  123
         5   48    0  171  145    3    1   39
         6    0    2    0  712    0    0    0
         7    0  116    0   82    1    0   10

Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

  • 7
  • 如同上述,cluster的order並沒有意義,從table可以看出,Hierarchical(row)的群集7對應到K means(column)的群集2是116,整排直行中最高並且過半。

Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

  • 5
  • Hierarchical的群集5對應到K means的群集3是171,是最高且過半。

Which Hierarchical Cluster best corresponds to K-Means Cluster 7?

  • No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7.
  • 單看K means的群集7,可以發現Hierarchical的群集1和4都非常高,但看總體比例卻都沒有過半,因此沒有一個與之對應的群集。

Which Hierarchical Cluster best corresponds to K-Means Cluster 6?

  • 2
  • Hierarchical的群集2對應到K means的群集6是320,幾乎是壟斷全部的比例,可見兩群集非常相似。
【討論問題】

字頻表是什麼?它的資料格式?

  • 字頻表是一種文字分析技術的產出,從多種來源蒐集(網路)文章、新聞、雜誌或評論,再將雜亂的資料進行整理和分析,最後統計出不同特定單詞出現的次數,以data frame的形式儲存,每一欄(column)代表不同的詞彙,而每一列(row)則存放該詞彙的出處。

使用字頻表作集群分析時,區隔變數是什麼?

  • 集群分析會以詞彙的不同來做分群,也就是所謂的區隔變數。
  • 區隔變數的值則代表該詞彙出現的次數,因此,將每一個集群做平均,即可得到該群中不同詞彙出現的頻率,並依此判斷集群的特性。

從樹狀圖判斷群數和從應用需求決定群數有什麼差別?

  • 樹狀圖會計算出點與點之間的距離(各點的差加總再開根號),並找出群內距離最小,且群間距離最大的分群方式,最後用樹狀圖呈現。
  • 樹狀圖中兩個集群間的垂直線愈長,代表集群到集群中心點之間的距離愈遠,這也是我們所偏好的;在垂直線上做水平切線,其切經過多少條垂直線,則代表有多少個集群。
  • 因此,跟從應用需求決定集群數目相反,我們在用樹狀圖的判斷方法時,剛開始不需考慮要分多少群,而是在機器幫我們做完運算之後,看圖來選擇要如何分群。








