Document Clustering

Source: Analytics Edge Clustering Homework

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relavent results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means.

In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.

The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_6_Clustering")
kos <- read.csv("dailykos.csv")
str(kos)

## 'data.frame':    3430 obs. of  1545 variables:
##  $ abandon                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ abc                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ability                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ abortion                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ absolute                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ abstain                                 : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ abu                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ abuse                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ accept                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ access                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ accomplish                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ account                                 : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ accurate                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ accusations                             : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ achieve                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ acknowledge                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ act                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ action                                  : int  2 0 0 0 0 0 0 0 0 0 ...
##  $ active                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ activist                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ actual                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ add                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ added                                   : int  1 0 0 0 1 0 0 0 1 0 ...
##  $ addition                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ address                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ admin                                   : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ administration                          : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ admit                                   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ advance                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ advantage                               : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ advertise                               : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ advised                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ affair                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ affect                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ affiliate                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ afghanistan                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ afraid                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ afternoon                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ agencies                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ agenda                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ agree                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ahead                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ aid                                     : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ aim                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ air                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alaska                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ allegation                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ allegory                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ allied                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ allowed                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alternative                             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ altsite                                 : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ amazing                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ amendment                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ america                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ american                                : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ amount                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ analysis                                : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ analyst                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ anecdotal                               : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ anger                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ angry                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ announce                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ annual                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ answer                                  : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ apologies                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ apparent                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ appeal                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ appearance                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ applied                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ appointed                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ approach                                : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ approval                                : int  1 0 0 0 1 0 0 0 1 0 ...
##  $ apr                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ april                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arab                                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area                                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arent                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arg                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ argue                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ argument                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arizona                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arm                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ armstrong                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ army                                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arrest                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arrive                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ article                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ asap                                    : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ asked                                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ass                                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ assess                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ assist                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ associate                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ assume                                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ atlanta                                 : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ atrios                                  : int  0 0 0 0 0 0 1 0 0 0 ...
##   [list output truncated]

Hierarchical clustering

kosDist <- dist(kos, method = "euclidean")
kosHierClust <- hclust(kosDist, metho = "ward.D")
plot(kosHierClust)

Based on the dendrogram, it seems two or three clusters would be a good choice. But in this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. So we need to have something more than two or three. Let’s pick 7 clusters.

kosClusters <- cutree(kosHierClust, k = 7)

Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

cluster1 <- subset(kos, kosClusters == 1)
cluster2 <- subset(kos, kosClusters == 2)
cluster3 <- subset(kos, kosClusters == 3)
cluster4 <- subset(kos, kosClusters == 4)
cluster5 <- subset(kos, kosClusters == 5)
cluster6 <- subset(kos, kosClusters == 6)
cluster7 <- subset(kos, kosClusters == 7)

To look at the frequency of each word in each cluster, we use the colMeans command. We will look at the top 6 words in each cluster

tail(sort(colMeans(cluster1)))

##      state republican       poll   democrat      kerry       bush 
##  0.7575039  0.7590837  0.9036335  0.9194313  1.0624013  1.7053712

tail(sort(colMeans(cluster2)))

##      bush  democrat challenge      vote      poll  november 
##  2.847352  2.850467  4.096573  4.398754  4.847352 10.339564

tail(sort(colMeans(cluster3)))

##      elect    parties      state republican   democrat       bush 
##   1.647059   1.665775   2.320856   2.524064   3.823529   4.406417

tail(sort(colMeans(cluster4)))

## campaign    voter presided     poll     bush    kerry 
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849

tail(sort(colMeans(cluster5)))

##       american       presided administration            war           iraq 
##       1.090909       1.120393       1.230958       1.776413       2.427518 
##           bush 
##       3.941032

tail(sort(colMeans(cluster6)))

##      race      bush     kerry     elect  democrat      poll 
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325

tail(sort(colMeans(cluster7)))

## democrat    clark   edward     poll    kerry     dean 
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

K-MEANS CLUSTERING

set.seed(1000)
KMC <- kmeans(kos, centers = 7)
str(KMC)

## List of 9
##  $ cluster     : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ...
##  $ centers     : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ...
##  $ totss       : num 896461
##  $ withinss    : num [1:7] 76583 52693 99504 258927 88632 ...
##  $ tot.withinss: num 730632
##  $ betweenss   : num 165829
##  $ size        : int [1:7] 146 144 277 2063 163 329 308
##  $ iter        : int 7
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Subset data into 7 clusters

Kcluster1 <- subset(kos, KMC$cluster == 1)
Kcluster2 <- subset(kos, KMC$cluster == 2)
Kcluster3 <- subset(kos, KMC$cluster == 3)
Kcluster4 <- subset(kos, KMC$cluster == 4)
Kcluster5 <- subset(kos, KMC$cluster == 5)
Kcluster6 <- subset(kos, KMC$cluster == 6)
Kcluster7 <- subset(kos, KMC$cluster == 7)

Output the six most frequent words in each cluster, for each of the k-means clusters.

tail(sort(colMeans(Kcluster1)))

##          state           iraq          kerry administration       presided 
##       1.609589       1.616438       1.636986       2.664384       2.767123 
##           bush 
##      11.431507

tail(sort(colMeans(Kcluster2)))

## primaries  democrat    edward     clark     kerry      dean 
##  2.319444  2.694444  2.798611  3.090278  4.979167  8.277778

tail(sort(colMeans(Kcluster3)))

## administration          iraqi       american           bush            war 
##       1.389892       1.610108       1.685921       2.610108       3.025271 
##           iraq 
##       4.093863

tail(sort(colMeans(Kcluster4)))

##      elect republican      kerry       poll   democrat       bush 
##  0.6010664  0.6175473  0.6495395  0.7474552  0.7891420  1.1473582

tail(sort(colMeans(Kcluster5)))

##       race     senate      state    parties republican   democrat 
##   2.484663   2.650307   3.521472   3.619632   4.638037   6.993865

tail(sort(colMeans(Kcluster6)))

##  democrat      bush challenge      vote      poll  november 
##  2.899696  2.960486  4.121581  4.446809  4.872340 10.370821

tail(sort(colMeans(Kcluster7)))

## presided    voter campaign     poll     bush    kerry 
## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.

Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

table(KMC$cluster)

## 
##    1    2    3    4    5    6    7 
##  146  144  277 2063  163  329  308

table(kosClusters)

## kosClusters
##    1    2    3    4    5    6    7 
## 1266  321  374  139  407  714  209

table(kosClusters, KMC$cluster)

##            
## kosClusters    1    2    3    4    5    6    7
##           1    3   11   64 1045   32    0  111
##           2    0    0    0    0    0  320    1
##           3   85   10   42   79  126    8   24
##           4   10    5    0    0    1    0  123
##           5   48    0  171  145    3    1   39
##           6    0    2    0  712    0    0    0
##           7    0  116    0   82    1    0   10

we read that 116 (80.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 7.