DOCUMENT CLUSTERING WITH DAILY KOS

Reproducible notes Document Clustering with Dailt Kos

Anil Kumar

Source File code Connect connect

PRELIMINARIES

Load the library that are required in the assignment:

library("tm")
library("SnowballC")

library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
library("randomForest")

INTRODUCTION

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relavent results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means.

we'll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view

HIERARCHICAL CLUSTERING

Let's start with Hierarchical Clustering, compute the distance by using the dist function and method euclidean

dailykos = read.csv("dailykos.csv")
kosDist = dist(dailykos, method="euclidean")
kosClust = hclust(kosDist, method="ward.D")

Plot the dendrogram of hierarchical clustering model

plot(kosClust)

plot of chunk plot Use the cutree function to split your data into 7 clusters.

Now, we don't really want to run tapply on every single variable when we have over 1,000 different variables. Let's instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

HierarchicalCluster = cutree(kosClust, k = 7)

Make subset of data into 7 cluster

Cluster1 = subset(dailykos, HierarchicalCluster == 1)
Cluster2 = subset(dailykos, HierarchicalCluster == 2)
Cluster3 = subset(dailykos, HierarchicalCluster == 3)
Cluster4 = subset(dailykos, HierarchicalCluster == 4)
Cluster5 = subset(dailykos, HierarchicalCluster == 5)
Cluster6 = subset(dailykos, HierarchicalCluster == 6)
Cluster7 = subset(dailykos, HierarchicalCluster == 7)

Here we have seven cluster according to each groups.

use the nrow function on each of these new datasets,

number or observation

table(HierarchicalCluster)
## HierarchicalCluster
##    1    2    3    4    5    6    7 
## 1266  321  374  139  407  714  209

Let's look at the top 6 words in each cluster, this can be done by using the colMeans function and sort the value in increasing order and after that calculate tail which will return top 6 words in the cluster this can be done for other cluster too.

tail(sort(colMeans(Cluster1)))
##      state republican       poll   democrat      kerry       bush 
##  0.7575039  0.7590837  0.9036335  0.9194313  1.0624013  1.7053712

Now k-means clustering

in this k-means we are taking the number of cluster 7, subset the data into 7 cluster

set.seed(1000)
KmeansCluster = kmeans(dailykos, centers=7)

subset data into the 7 clusters

subset the data as previous we did into seven cluster for this KmeansCluster

KmeansCluster1 = subset(dailykos, KmeansCluster$cluster == 1)
KmeansCluster2 = subset(dailykos, KmeansCluster$cluster == 2)
KmeansCluster3 = subset(dailykos, KmeansCluster$cluster == 3)
KmeansCluster4 = subset(dailykos, KmeansCluster$cluster == 4)
KmeansCluster5 = subset(dailykos, KmeansCluster$cluster == 5)
KmeansCluster6 = subset(dailykos, KmeansCluster$cluster == 6)
KmeansCluster7 = subset(dailykos, KmeansCluster$cluster == 7)
KmeansCluster = split(dailykos, KmeansCluster$cluster)

First cluster

#KmeansCluster[[1]]

six most frequent words in each cluster

we can find the command use in the previous

tail(sort(colMeans(KmeansCluster1)))
##          state           iraq          kerry administration       presided 
##       1.609589       1.616438       1.636986       2.664384       2.767123 
##           bush 
##      11.431507
tail(sort(colMeans(KmeansCluster2)))
## primaries  democrat    edward     clark     kerry      dean 
##  2.319444  2.694444  2.798611  3.090278  4.979167  8.277778
tail(sort(colMeans(KmeansCluster3)))
## administration          iraqi       american           bush            war 
##       1.389892       1.610108       1.685921       2.610108       3.025271 
##           iraq 
##       4.093863
tail(sort(colMeans(KmeansCluster4)))
##      elect republican      kerry       poll   democrat       bush 
##  0.6010664  0.6175473  0.6495395  0.7474552  0.7891420  1.1473582
tail(sort(colMeans(KmeansCluster5)))
##       race     senate      state    parties republican   democrat 
##   2.484663   2.650307   3.521472   3.619632   4.638037   6.993865
tail(sort(colMeans(KmeansCluster6)))
##  democrat      bush challenge      vote      poll  november 
##  2.899696  2.960486  4.121581  4.446809  4.872340 10.370821
tail(sort(colMeans(KmeansCluster7)))
## presided    voter campaign     poll     bush    kerry 
## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

for this type of question we can use table

#table(HierarchicalCluster, KmeansCluster$cluster)