§ 1.1 Running the dist function will probably take you a while. Why? Select all that apply.
distance = dist(dailykos, method = "euclidean")
kosclust = hclust(distance, method = "ward.D")
print("We have a lot of observations, so it takes a long time to compute the distance between each pair of observations.")
[1] "We have a lot of observations, so it takes a long time to compute the distance between each pair of observations."
print("We have a lot of variables, so the distance computation is long.")
[1] "We have a lot of variables, so the distance computation is long."
§ 1.2 Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the following seem like good choices for the number of clusters? Select all that apply.
plot(kosclust)
print("2,3")
[1] "2,3"
§ 1.3 In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters? Select all that apply.
print("7,8")
[1] "7,8"
§ 1.4 How many observations are in cluster 3? Which cluster has the most observations? Which cluster has the fewest observations?
hiergroup = cutree(kosclust, k = 7)
HierCluster = split(dailykos, hiergroup)
sapply(HierCluster, function(p) nrow(p))
1 2 3 4 5 6 7
1266 321 374 139 407 714 209
print("374")
[1] "374"
print("Cluster 1")
[1] "Cluster 1"
print("Cluster 4")
[1] "Cluster 4"
§ 1.5 What is the most frequent word in this cluster, in terms of average value?
tail(sort(colMeans(HierCluster[[1]])))
state republican poll democrat kerry bush
0.7575 0.7591 0.9036 0.9194 1.0624 1.7054
print("bush ")
[1] "bush "
§ 1.6 Which words best describe cluster 2?
tail(sort(colMeans(HierCluster[[2]])))
bush democrat challenge vote poll november
2.847 2.850 4.097 4.399 4.847 10.340
print("november, poll, vote, challenge")
[1] "november, poll, vote, challenge"
Which cluster could best be described as the cluster related to the Iraq war?
tail(sort(colMeans(HierCluster[[3]])))
elect parties state republican democrat bush
1.647 1.666 2.321 2.524 3.824 4.406
tail(sort(colMeans(HierCluster[[4]])))
campaign voter presided poll bush kerry
1.432 1.540 1.626 3.590 7.835 8.439
tail(sort(colMeans(HierCluster[[5]])))
american presided administration war iraq
1.091 1.120 1.231 1.776 2.428
bush
3.941
tail(sort(colMeans(HierCluster[[6]])))
race bush kerry elect democrat poll
0.4580 0.4888 0.5168 0.5350 0.5644 0.5812
tail(sort(colMeans(HierCluster[[7]])))
democrat clark edward poll kerry dean
2.148 2.498 2.608 2.766 3.952 5.804
print("Cluster 5")
[1] "Cluster 5"
In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic party?
print("Cluster 7")
[1] "Cluster 7"
§ 2.1 How many observations are in Cluster 3? Which cluster has the most observations? Which cluster has the fewest number of observations?
set.seed(1000)
kmc = kmeans(dailykos, centers=7)
KmeansCluster = split(dailykos, kmc$cluster)
sapply(KmeansCluster, function(p) nrow(p))
1 2 3 4 5 6 7
146 144 277 2063 163 329 308
print("277")
[1] "277"
print("Cluster 4")
[1] "Cluster 4"
print("Cluster 2")
[1] "Cluster 2"
§ 2.2 Which k-means cluster best corresponds to the Iraq War? Which k-means cluster best corresponds to the democratic party?
split(dailykos, kmc$cluster) %>% sapply(function(x)
x %>% colMeans %>% sort %>% tail %>% names) %>% t
[,1] [,2] [,3] [,4] [,5] [,6]
1 "state" "iraq" "kerry" "administration" "presided" "bush"
2 "primaries" "democrat" "edward" "clark" "kerry" "dean"
3 "administration" "iraqi" "american" "bush" "war" "iraq"
4 "elect" "republican" "kerry" "poll" "democrat" "bush"
5 "race" "senate" "state" "parties" "republican" "democrat"
6 "democrat" "bush" "challenge" "vote" "poll" "november"
7 "presided" "voter" "campaign" "poll" "bush" "kerry"
print("Cluster 3")
[1] "Cluster 3"
print("Cluster 2")
[1] "Cluster 2"
§ 2.3 Which Hierarchical Cluster best corresponds to K-Means Cluster 2?
table(Hierarchical=hiergroup, KMeans=kmc$cluster)
KMeans
Hierarchical 1 2 3 4 5 6 7
1 3 11 64 1045 32 0 111
2 0 0 0 0 0 320 1
3 85 10 42 79 126 8 24
4 10 5 0 0 1 0 123
5 48 0 171 145 3 1 39
6 0 2 0 712 0 0 0
7 0 116 0 82 1 0 10
print("7")
§ 2.4 Which Hierarchical Cluster best corresponds to K-Means Cluster 3?
print("5")
§ 2.5 Which Hierarchical Cluster best corresponds to K-Means Cluster 7?
print("No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7")
§ 2.6 Which Hierarchical Cluster best corresponds to K-Means Cluster 6?
print("2")