Analytics Edge: Unit 6 - Document Clustering with Daily Kos

R Exercises

Hierarchical Clustering

Let’s start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method=“euclidean”), and use hclust to build the model (using method=“ward.D”). You should cluster on all of the variables.

# Load in the dataset
dailykos = read.csv("dailykos.csv")
# Hierarchical cluster algorithm
kosDist = dist(dailykos, method="euclidean")
kosHierClust = hclust(kosDist, method="ward.D")

Running the dist function will probably take you a while. Why?

We have a lot of observations, so it takes a long time to compute the distance between each pair of observations. We have a lot of variables, so the distance computation is long.

Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the following seem like good choices for the number of clusters?

# Plots the dendrogram
plot(kosHierClust)

The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the dendrogram where it crosses 2 or 3 vertical lines).

In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters?

Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.

Choosing 7 clusters

Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.

Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

How many observations are in cluster 3?

# Plots the dendrogram
plot(kosHierClust)
# Divides it into 7 clusters
rect.hclust(kosHierClust, k = 7, border = "red")

hierGroups = cutree(kosHierClust, k = 7)
# Divides dataset into 7 different subsets
HierCluster1 = subset(dailykos, hierGroups == 1)

HierCluster2 = subset(dailykos, hierGroups == 2)

HierCluster3 = subset(dailykos, hierGroups == 3)

HierCluster4 = subset(dailykos, hierGroups == 4)

HierCluster5 = subset(dailykos, hierGroups == 5)

HierCluster6 = subset(dailykos, hierGroups == 6)

HierCluster7 = subset(dailykos, hierGroups == 7)

table(hierGroups)
## hierGroups
##    1    2    3    4    5    6    7 
## 1266  321  374  139  407  714  209

What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:

# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster1)))
kable(z)

	x
state	0.7575039
republican	0.7590837
poll	0.9036335
democrat	0.9194313
kerry	1.0624013
bush	1.7053712

After running the R command given above, we can see that the most frequent word on average is “bush”. This corresponds to President George W. Bush.

Which words best describe cluster 2?

# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster2)))
kable(z)

	x
bush	2.847352
democrat	2.850467
challenge	4.096573
vote	4.398754
poll	4.847352
november	10.339564

You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge.

Which cluster could best be described as the cluster related to the Iraq war?

# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster3)))
kable(z)

	x
elect	1.647059
parties	1.665775
state	2.320856
republican	2.524064
democrat	3.823529
bush	4.406417

z = tail(sort(colMeans(HierCluster4)))
kable(z)

	x
campaign	1.431655
voter	1.539568
presided	1.625899
poll	3.589928
bush	7.834532
kerry	8.438849

z = tail(sort(colMeans(HierCluster5)))
kable(z)

	x
american	1.090909
presided	1.120393
administration	1.230958
war	1.776413
iraq	2.427518
bush	3.941032

z = tail(sort(colMeans(HierCluster6)))
kable(z)

	x
race	0.4579832
bush	0.4887955
kerry	0.5168067
elect	0.5350140
democrat	0.5644258
poll	0.5812325

z = tail(sort(colMeans(HierCluster7)))
kable(z)

	x
democrat	2.148325
clark	2.497608
edward	2.607655
poll	2.765550
kerry	3.952153
dean	5.803828

The most common words in Cluster 5 are bush, iraq, war, and administration, so it is the cluster that can best be described as corresponding to the Iraq war.

In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic party?

# Sorts the words used in the cluster
z = tail(sort(colMeans(HierCluster3)))
kable(z)
z = tail(sort(colMeans(HierCluster4)))
kable(z)
z = tail(sort(colMeans(HierCluster5)))
kable(z)
z = tail(sort(colMeans(HierCluster6)))
kable(z)
z = tail(sort(colMeans(HierCluster7)))
kable(z)

The most common words in Cluster 7 are dean, kerry, poll, and edward.

K-Means Clustering

Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.

Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.

# K-means algorithm
set.seed(1000)
kmc = kmeans(dailykos, centers=7)
# Divides the dataset into 7 different subsets for each cluster
KmeansCluster1 = subset(dailykos, kmc$cluster == 1)

KmeansCluster2 = subset(dailykos, kmc$cluster == 2)

KmeansCluster3 = subset(dailykos, kmc$cluster == 3)

KmeansCluster4 = subset(dailykos, kmc$cluster == 4)

KmeansCluster5 = subset(dailykos, kmc$cluster == 5)

KmeansCluster6 = subset(dailykos, kmc$cluster == 6)

KmeansCluster7 = subset(dailykos, kmc$cluster == 7)

How many observations are in Cluster 3?

# Outputs the number of row 
nrow(KmeansCluster3)
## [1] 277

Which cluster has the fewest number of observations?

# Outputs the number of row 
nrow(KmeansCluster1)
## [1] 146
nrow(KmeansCluster2)
## [1] 144
nrow(KmeansCluster3)
## [1] 277
nrow(KmeansCluster4)
## [1] 2063
nrow(KmeansCluster5)
## [1] 163
nrow(KmeansCluster6)
## [1] 329
nrow(KmeansCluster7)
## [1] 308

Which k-means cluster best corresponds to the Iraq War?

# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)

	x
state	1.609589
iraq	1.616438
kerry	1.636986
administration	2.664384
presided	2.767123
bush	11.431507

z = tail(sort(colMeans(KmeansCluster2)))
kable(z)

	x
primaries	2.319444
democrat	2.694444
edward	2.798611
clark	3.090278
kerry	4.979167
dean	8.277778

z = tail(sort(colMeans(KmeansCluster3)))
kable(z)

	x
administration	1.389892
iraqi	1.610108
american	1.685921
bush	2.610108
war	3.025271
iraq	4.093863

z = tail(sort(colMeans(KmeansCluster4)))
kable(z)

	x
elect	0.6010664
republican	0.6175473
kerry	0.6495395
poll	0.7474552
democrat	0.7891420
bush	1.1473582

z = tail(sort(colMeans(KmeansCluster5)))
kable(z)

	x
race	2.484663
senate	2.650307
state	3.521472
parties	3.619632
republican	4.638037
democrat	6.993865

z = tail(sort(colMeans(KmeansCluster6)))
kable(z)

	x
democrat	2.899696
bush	2.960486
challenge	4.121581
vote	4.446809
poll	4.872340
november	10.370821

z = tail(sort(colMeans(KmeansCluster7)))
kable(z)

	x
presided	1.324675
voter	1.334416
campaign	1.383117
poll	2.788961
bush	5.970779
kerry	6.480519

Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)

# Sorts the words used in the cluster
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
z = tail(sort(colMeans(KmeansCluster6)))
kable(z)
z = tail(sort(colMeans(KmeansCluster7)))
kable(z)

For the rest of this problem, we’ll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.

# Tabulates the hierarchical cluster group vs the k-means cluster group 
z = table(hierGroups, kmc$cluster)
kable(z)

1	2	3	4	5	6	7
3	11	64	1045	32	0	111
0	0	0	0	0	320	1
85	10	42	79	126	8	24
10	5	0	0	1	0	123
48	0	171	145	3	1	39
0	2	0	712	0	0	0
0	116	0	82	1	0	10

cl1 = z[1:7]/sum(z[1:7])
x = cl1
kable(x)

x
0.0205479
0.0000000
0.5821918
0.0684932
0.3287671
0.0000000
0.0000000

which.max(cl1)
## [1] 3

cl2 = z[8:14]/sum(z[8:14])
x = cl2
kable(x)

x
0.0763889
0.0000000
0.0694444
0.0347222
0.0000000
0.0138889
0.8055556

which.max(cl2)
## [1] 7

cl3 = z[15:21]/sum(z[15:21])
x = cl3
kable(x)

x
0.2310469
0.0000000
0.1516245
0.0000000
0.6173285
0.0000000
0.0000000

which.max(cl3)
## [1] 5

cl4 = z[22:28]/sum(z[22:28])
x = cl4
kable(x)

x
0.5065439
0.0000000
0.0382937
0.0000000
0.0702860
0.3451285
0.0397479

which.max(cl4)
## [1] 1

cl5 = z[29:35]/sum(z[29:35])
x = cl5
kable(x)

x
0.1963190
0.0000000
0.7730061
0.0061350
0.0184049
0.0000000
0.0061350

which.max(cl5)
## [1] 3

cl6 = z[36:42]/sum(z[36:42])
x = cl6
kable(x)

x
0.0000000
0.9726444
0.0243161
0.0000000
0.0030395
0.0000000
0.0000000

which.max(cl6)
## [1] 2

cl7 = z[43:49]/sum(z[43:49])
x = cl7
kable(x)

x
0.3603896
0.0032468
0.0779221
0.3993506
0.1266234
0.0000000
0.0324675

which.max(cl7)
## [1] 4

Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 116 (80.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 7.

Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 171 (61.7%) of the observations in K-Means Cluster 3 also fall in Hierarchical Cluster 5.

Which Hierarchical Cluster best corresponds to K-Means Cluster 7?

From “table(hierGroups, KmeansCluster$cluster)”, we read that no more than 123 (39.9%) of the observations in K-Means Cluster 7 fall in any hierarchical cluster.

Which Hierarchical Cluster best corresponds to K-Means Cluster 6?

From “table(hierGroups, KmeansCluster$cluster)”, we read that 320 (97.3%) of observations in K-Means Cluster 6 fall in Hierarchical Cluster 2.