Background Information on the Dataset

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relevant results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means.

In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.

The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.

R Exercises

Hierarchical Clustering

Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.

# Load in the dataset
dailykos = read.csv("dailykos.csv")
# Hierarchical cluster algorithm
kosDist = dist(dailykos, method="euclidean")
kosHierClust = hclust(kosDist, method="ward.D")

Running the dist function will probably take you a while. Why?

We have a lot of observations, so it takes a long time to compute the distance between each pair of observations. We have a lot of variables, so the distance computation is long.

Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the following seem like good choices for the number of clusters?

# Plots the dendrogram
plot(kosHierClust)

The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the dendrogram where it crosses 2 or 3 vertical lines).

In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters?

Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.

Choosing 7 clusters

Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.

Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

How many observations are in cluster 3?
# Plots the dendrogram
plot(kosHierClust)
# Divides it into 7 clusters
rect.hclust(kosHierClust, k = 7, border = "red")

hierGroups = cutree(kosHierClust, k = 7)
# Divides dataset into 7 different subsets
HierCluster1 = subset(dailykos, hierGroups == 1)

HierCluster2 = subset(dailykos, hierGroups == 2)

HierCluster3 = subset(dailykos, hierGroups == 3)

HierCluster4 = subset(dailykos, hierGroups == 4)

HierCluster5 = subset(dailykos, hierGroups == 5)

HierCluster6 = subset(dailykos, hierGroups == 6)

HierCluster7 = subset(dailykos, hierGroups == 7)

table(hierGroups)
## hierGroups
##    1    2    3    4    5    6    7 
## 1266  321  374  139  407  714  209
What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster1)))
kable(z)
x
state 0.7575039
republican 0.7590837
poll 0.9036335
democrat 0.9194313
kerry 1.0624013
bush 1.7053712

After running the R command given above, we can see that the most frequent word on average is “bush”. This corresponds to President George W. Bush.

Which words best describe cluster 2?
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster2)))
kable(z)
x
bush 2.847352
democrat 2.850467
challenge 4.096573
vote 4.398754
poll 4.847352
november 10.339564

You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge.

In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic party?
# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster3)))
kable(z)
z = tail(sort(colMeans(HierCluster4)))
kable(z)
z = tail(sort(colMeans(HierCluster5)))
kable(z)
z = tail(sort(colMeans(HierCluster6)))
kable(z)
z = tail(sort(colMeans(HierCluster7)))
kable(z)

The most common words in Cluster 7 are dean, kerry, poll, and edward.

K-Means Clustering

Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.

Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.

# K-means algorithm
set.seed(1000)
kmc = kmeans(dailykos, centers=7)
# Divides the dataset into 7 different subsets for each cluster
KmeansCluster1 = subset(dailykos, kmc$cluster == 1)

KmeansCluster2 = subset(dailykos, kmc$cluster == 2)

KmeansCluster3 = subset(dailykos, kmc$cluster == 3)

KmeansCluster4 = subset(dailykos, kmc$cluster == 4)

KmeansCluster5 = subset(dailykos, kmc$cluster == 5)

KmeansCluster6 = subset(dailykos, kmc$cluster == 6)

KmeansCluster7 = subset(dailykos, kmc$cluster == 7)

How many observations are in Cluster 3?

# Outputs the number of row 
nrow(KmeansCluster3)
## [1] 277

Which cluster has the fewest number of observations?

# Outputs the number of row 
nrow(KmeansCluster1)
## [1] 146
nrow(KmeansCluster2)
## [1] 144
nrow(KmeansCluster3)
## [1] 277
nrow(KmeansCluster4)
## [1] 2063
nrow(KmeansCluster5)
## [1] 163
nrow(KmeansCluster6)
## [1] 329
nrow(KmeansCluster7)
## [1] 308

Which k-means cluster best corresponds to the Iraq War?

# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
x
state 1.609589
iraq 1.616438
kerry 1.636986
administration 2.664384
presided 2.767123
bush 11.431507
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
x
primaries 2.319444
democrat 2.694444
edward 2.798611
clark 3.090278
kerry 4.979167
dean 8.277778
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
x
administration 1.389892
iraqi 1.610108
american 1.685921
bush 2.610108
war 3.025271
iraq 4.093863
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
x
elect 0.6010664
republican 0.6175473
kerry 0.6495395
poll 0.7474552
democrat 0.7891420
bush 1.1473582
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
x
race 2.484663
senate 2.650307
state 3.521472
parties 3.619632
republican 4.638037
democrat 6.993865
z = tail(sort(colMeans(KmeansCluster6)))
kable(z)
x
democrat 2.899696
bush 2.960486
challenge 4.121581
vote 4.446809
poll 4.872340
november 10.370821
z = tail(sort(colMeans(KmeansCluster7)))
kable(z)
x
presided 1.324675
voter 1.334416
campaign 1.383117
poll 2.788961
bush 5.970779
kerry 6.480519

Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)

# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
z = tail(sort(colMeans(KmeansCluster6)))
kable(z)
z = tail(sort(colMeans(KmeansCluster7)))
kable(z)

For the rest of this problem, we’ll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.

# Tabulates the hierarchical cluster group vs the k-means cluster group 
z = table(hierGroups, kmc$cluster)
kable(z)
1 2 3 4 5 6 7
3 11 64 1045 32 0 111
0 0 0 0 0 320 1
85 10 42 79 126 8 24
10 5 0 0 1 0 123
48 0 171 145 3 1 39
0 2 0 712 0 0 0
0 116 0 82 1 0 10
cl1 = z[1:7]/sum(z[1:7])
x = cl1
kable(x)
x
0.0205479
0.0000000
0.5821918
0.0684932
0.3287671
0.0000000
0.0000000
which.max(cl1)
## [1] 3
cl2 = z[8:14]/sum(z[8:14])
x = cl2
kable(x)
x
0.0763889
0.0000000
0.0694444
0.0347222
0.0000000
0.0138889
0.8055556
which.max(cl2)
## [1] 7
cl3 = z[15:21]/sum(z[15:21])
x = cl3
kable(x)
x
0.2310469
0.0000000
0.1516245
0.0000000
0.6173285
0.0000000
0.0000000
which.max(cl3)
## [1] 5
cl4 = z[22:28]/sum(z[22:28])
x = cl4
kable(x)
x
0.5065439
0.0000000
0.0382937
0.0000000
0.0702860
0.3451285
0.0397479
which.max(cl4)
## [1] 1
cl5 = z[29:35]/sum(z[29:35])
x = cl5
kable(x)
x
0.1963190
0.0000000
0.7730061
0.0061350
0.0184049
0.0000000
0.0061350
which.max(cl5)
## [1] 3
cl6 = z[36:42]/sum(z[36:42])
x = cl6
kable(x)
x
0.0000000
0.9726444
0.0243161
0.0000000
0.0030395
0.0000000
0.0000000
which.max(cl6)
## [1] 2
cl7 = z[43:49]/sum(z[43:49])
x = cl7
kable(x)
x
0.3603896
0.0032468
0.0779221
0.3993506
0.1266234
0.0000000
0.0324675
which.max(cl7)
## [1] 4
Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 116 (80.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 7.

Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 171 (61.7%) of the observations in K-Means Cluster 3 also fall in Hierarchical Cluster 5.

Which Hierarchical Cluster best corresponds to K-Means Cluster 7?

From “table(hierGroups, KmeansCluster$cluster)”, we read that no more than 123 (39.9%) of the observations in K-Means Cluster 7 fall in any hierarchical cluster.

Which Hierarchical Cluster best corresponds to K-Means Cluster 6?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 320 (97.3%) of observations in K-Means Cluster 6 fall in Hierarchical Cluster 2.