Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.
Clustering methods can be used to automatically group search results into categories, making it easier to find relevant results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means.
In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.
The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.
Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.
Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.
# Load in the dataset
dailykos = read.csv("dailykos.csv")
# Hierarchical cluster algorithm
kosDist = dist(dailykos, method="euclidean")
kosHierClust = hclust(kosDist, method="ward.D")
We have a lot of observations, so it takes a long time to compute the distance between each pair of observations. We have a lot of variables, so the distance computation is long.
# Plots the dendrogram
plot(kosHierClust)
The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the dendrogram where it crosses 2 or 3 vertical lines).
Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.
Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.
Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.
# Plots the dendrogram
plot(kosHierClust)
# Divides it into 7 clusters
rect.hclust(kosHierClust, k = 7, border = "red")
hierGroups = cutree(kosHierClust, k = 7)
# Divides dataset into 7 different subsets
HierCluster1 = subset(dailykos, hierGroups == 1)
HierCluster2 = subset(dailykos, hierGroups == 2)
HierCluster3 = subset(dailykos, hierGroups == 3)
HierCluster4 = subset(dailykos, hierGroups == 4)
HierCluster5 = subset(dailykos, hierGroups == 5)
HierCluster6 = subset(dailykos, hierGroups == 6)
HierCluster7 = subset(dailykos, hierGroups == 7)
table(hierGroups)
## hierGroups
## 1 2 3 4 5 6 7
## 1266 321 374 139 407 714 209
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster1)))
kable(z)
x | |
---|---|
state | 0.7575039 |
republican | 0.7590837 |
poll | 0.9036335 |
democrat | 0.9194313 |
kerry | 1.0624013 |
bush | 1.7053712 |
After running the R command given above, we can see that the most frequent word on average is “bush”. This corresponds to President George W. Bush.
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster2)))
kable(z)
x | |
---|---|
bush | 2.847352 |
democrat | 2.850467 |
challenge | 4.096573 |
vote | 4.398754 |
poll | 4.847352 |
november | 10.339564 |
You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge.
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster3)))
kable(z)
z = tail(sort(colMeans(HierCluster4)))
kable(z)
z = tail(sort(colMeans(HierCluster5)))
kable(z)
z = tail(sort(colMeans(HierCluster6)))
kable(z)
z = tail(sort(colMeans(HierCluster7)))
kable(z)
The most common words in Cluster 7 are dean, kerry, poll, and edward.
Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.
Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.
# K-means algorithm
set.seed(1000)
kmc = kmeans(dailykos, centers=7)
# Divides the dataset into 7 different subsets for each cluster
KmeansCluster1 = subset(dailykos, kmc$cluster == 1)
KmeansCluster2 = subset(dailykos, kmc$cluster == 2)
KmeansCluster3 = subset(dailykos, kmc$cluster == 3)
KmeansCluster4 = subset(dailykos, kmc$cluster == 4)
KmeansCluster5 = subset(dailykos, kmc$cluster == 5)
KmeansCluster6 = subset(dailykos, kmc$cluster == 6)
KmeansCluster7 = subset(dailykos, kmc$cluster == 7)
# Outputs the number of row
nrow(KmeansCluster3)
## [1] 277
# Outputs the number of row
nrow(KmeansCluster1)
## [1] 146
nrow(KmeansCluster2)
## [1] 144
nrow(KmeansCluster3)
## [1] 277
nrow(KmeansCluster4)
## [1] 2063
nrow(KmeansCluster5)
## [1] 163
nrow(KmeansCluster6)
## [1] 329
nrow(KmeansCluster7)
## [1] 308
# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
x | |
---|---|
state | 1.609589 |
iraq | 1.616438 |
kerry | 1.636986 |
administration | 2.664384 |
presided | 2.767123 |
bush | 11.431507 |
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
x | |
---|---|
primaries | 2.319444 |
democrat | 2.694444 |
edward | 2.798611 |
clark | 3.090278 |
kerry | 4.979167 |
dean | 8.277778 |
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
x | |
---|---|
administration | 1.389892 |
iraqi | 1.610108 |
american | 1.685921 |
bush | 2.610108 |
war | 3.025271 |
iraq | 4.093863 |
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
x | |
---|---|
elect | 0.6010664 |
republican | 0.6175473 |
kerry | 0.6495395 |
poll | 0.7474552 |
democrat | 0.7891420 |
bush | 1.1473582 |
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
x | |
---|---|
race | 2.484663 |
senate | 2.650307 |
state | 3.521472 |
parties | 3.619632 |
republican | 4.638037 |
democrat | 6.993865 |
z = tail(sort(colMeans(KmeansCluster6)))
kable(z)
x | |
---|---|
democrat | 2.899696 |
bush | 2.960486 |
challenge | 4.121581 |
vote | 4.446809 |
poll | 4.872340 |
november | 10.370821 |
z = tail(sort(colMeans(KmeansCluster7)))
kable(z)
x | |
---|---|
presided | 1.324675 |
voter | 1.334416 |
campaign | 1.383117 |
poll | 2.788961 |
bush | 5.970779 |
kerry | 6.480519 |
# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
z = tail(sort(colMeans(KmeansCluster6)))
kable(z)
z = tail(sort(colMeans(KmeansCluster7)))
kable(z)
# Tabulates the hierarchical cluster group vs the k-means cluster group
z = table(hierGroups, kmc$cluster)
kable(z)
1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|
3 | 11 | 64 | 1045 | 32 | 0 | 111 |
0 | 0 | 0 | 0 | 0 | 320 | 1 |
85 | 10 | 42 | 79 | 126 | 8 | 24 |
10 | 5 | 0 | 0 | 1 | 0 | 123 |
48 | 0 | 171 | 145 | 3 | 1 | 39 |
0 | 2 | 0 | 712 | 0 | 0 | 0 |
0 | 116 | 0 | 82 | 1 | 0 | 10 |
cl1 = z[1:7]/sum(z[1:7])
x = cl1
kable(x)
x |
---|
0.0205479 |
0.0000000 |
0.5821918 |
0.0684932 |
0.3287671 |
0.0000000 |
0.0000000 |
which.max(cl1)
## [1] 3
cl2 = z[8:14]/sum(z[8:14])
x = cl2
kable(x)
x |
---|
0.0763889 |
0.0000000 |
0.0694444 |
0.0347222 |
0.0000000 |
0.0138889 |
0.8055556 |
which.max(cl2)
## [1] 7
cl3 = z[15:21]/sum(z[15:21])
x = cl3
kable(x)
x |
---|
0.2310469 |
0.0000000 |
0.1516245 |
0.0000000 |
0.6173285 |
0.0000000 |
0.0000000 |
which.max(cl3)
## [1] 5
cl4 = z[22:28]/sum(z[22:28])
x = cl4
kable(x)
x |
---|
0.5065439 |
0.0000000 |
0.0382937 |
0.0000000 |
0.0702860 |
0.3451285 |
0.0397479 |
which.max(cl4)
## [1] 1
cl5 = z[29:35]/sum(z[29:35])
x = cl5
kable(x)
x |
---|
0.1963190 |
0.0000000 |
0.7730061 |
0.0061350 |
0.0184049 |
0.0000000 |
0.0061350 |
which.max(cl5)
## [1] 3
cl6 = z[36:42]/sum(z[36:42])
x = cl6
kable(x)
x |
---|
0.0000000 |
0.9726444 |
0.0243161 |
0.0000000 |
0.0030395 |
0.0000000 |
0.0000000 |
which.max(cl6)
## [1] 2
cl7 = z[43:49]/sum(z[43:49])
x = cl7
kable(x)
x |
---|
0.3603896 |
0.0032468 |
0.0779221 |
0.3993506 |
0.1266234 |
0.0000000 |
0.0324675 |
which.max(cl7)
## [1] 4
From “table(hierGroups, KmeansCluster$cluster)”, we read that 116 (80.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 7.
From “table(hierGroups, KmeansCluster$cluster)”, we read that 171 (61.7%) of the observations in K-Means Cluster 3 also fall in Hierarchical Cluster 5.
From “table(hierGroups, KmeansCluster$cluster)”, we read that no more than 123 (39.9%) of the observations in K-Means Cluster 7 fall in any hierarchical cluster.
From “table(hierGroups, KmeansCluster$cluster)”, we read that 320 (97.3%) of observations in K-Means Cluster 6 fall in Hierarchical Cluster 2.