Motivation

Analyze the the underlying structure of documents (text) in a quantitative manner.

The dataset I used is a Wikipedia pages of several Animation Movies. Clustering was performed to group the movies together. The output of such analysis can be used for Recommendations of similar movie titles.

Method

Required Libraries

We start by loading the libraries needed for Text Mining and Clustering, and WordCloud Library for visualization of Clustered Results

library(tm)
Loading required package: NLP
library(proxy)

Attaching package: ‘proxy’

The following objects are masked from ‘package:stats’:

    as.dist, dist

The following object is masked from ‘package:base’:

    as.matrix
library(RTextTools)
Loading required package: SparseM

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

    backsolve
library(fpc)   
library(wordcloud)
Loading required package: RColorBrewer
library(cluster)
library(tm)
library(stringi)
library(proxy)
library(wordcloud)

Reading Data

Next we load the data into the R Corpus Data Structure. In this case the data is set of Plot Summaries from various Animation Movies obtained from Wikipedia

path = "/home/saqib/ml_at_berkeley/DocumentClustering/corpus"
dir = DirSource(paste(path,"/Archive/",sep=""), encoding = "UTF-8")
#corpus = Corpus(dir, readerControl=list(reader=readPDF))
corpus = Corpus(dir)
summary(corpus)
                               Length Class             Mode
Despicable Me                  2      PlainTextDocument list
Despicable Me 2                2      PlainTextDocument list
Finding Dory                   2      PlainTextDocument list
Finding Nemo                   2      PlainTextDocument list
Frozen                         2      PlainTextDocument list
Ice Age                        2      PlainTextDocument list
Ice Age_ Dawn of the Dinosaurs 2      PlainTextDocument list
Kung Fu Panda                  2      PlainTextDocument list
Kung Fu Panda 2                2      PlainTextDocument list
Kung Fu Panda 3                2      PlainTextDocument list
Madagascar                     2      PlainTextDocument list
Madagascar_ Escape 2 Africa    2      PlainTextDocument list
Minions                        2      PlainTextDocument list
Shrek 2                        2      PlainTextDocument list
Shrek Forever After            2      PlainTextDocument list
Shrek the Third                2      PlainTextDocument list

Document Term Matrix (DTM)

The next step is to create a Document-Term Matrix (DTM). DTM is a matrix that lists all occurrences of words in the corpus. In a DTM, documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document n times, then the matrix entry for corresponding to that row and column is n, if it doesn’t occur at all, the entry is 0.

ndocs <- length(corpus)
# ignore extremely rare words i.e. terms that appear in less then 1% of the documents
minTermFreq <- ndocs * 0.01
# ignore overly common words i.e. terms that appear in more than 50% of the documents
maxTermFreq <- ndocs * .5
dtm = DocumentTermMatrix(corpus,
                         control = list(
                                        stopwords = TRUE, 
                                        wordLengths=c(4, 15),
                                        removePunctuation = T,
                                        removeNumbers = T,
                                        #stemming = T,
                                        bounds = list(global = c(minTermFreq, maxTermFreq))
                                       ))
#dtm <- dtm[, names(head(sort(colSums(as.matrix(dtm))), 400))]
#dtm <- dtm[, names(sort(colSums(as.matrix(dtm))))]
#print(as.matrix(dtm))
write.csv((as.matrix(dtm)), "test.csv")
#head(sort(as.matrix(dtm)[18,], decreasing = TRUE), n=15)
dtm.matrix = as.matrix(dtm)
#wordcloud(colnames(dtm.matrix), dtm.matrix[28, ], max.words = 20)

Let’s inspect the generate Document Term Matrix.

Notice that the Sparsity is fairly high. This is good, as this indicates that there are unique terms the documents (movie synopsis) that are not present in ALL the documents. This is useful for separating (distance) the documents.

inspect(dtm)
<<DocumentTermMatrix (documents: 16, terms: 2324)>>
Non-/sparse entries: 3687/33497
Sparsity           : 90%
Maximal term length: 15
Weighting          : term frequency (tf)
Sample             :
                                Terms
Docs                             alex charming dory fiona manny marlin nemo shen shifu shrek
  Despicable Me 2                   0        0    0     0     0      0    0    0     0     0
  Finding Nemo                      0        0   14     0     0     18   19    0     0     0
  Frozen                            0        0    0     0     0      0    0    0     0     0
  Ice Age_ Dawn of the Dinosaurs    0        0    0     0    10      0    0    0     0     0
  Kung Fu Panda                     0        0    0     0     0      0    0    0    15     0
  Kung Fu Panda 2                   0        0    0     0     0      0    0   27     4     0
  Kung Fu Panda 3                   0        0    0     0     0      0    0    0     7     0
  Madagascar_ Escape 2 Africa      14        0    0     0     0      0    0    0     0     0
  Minions                           0        0    0     0     0      0    0    0     0     0
  Shrek the Third                   0       19    0     7     0      0    0    0     0    22

Term frequency-invese document frequecy (tf-idf)

In the above DTM you will notice that terms that occur frequently have a high value associated. However, a certain term’s high frequency within a document means little if that terms appear frequenty in other documents in the corpus. In other words, terms that occur frequently within a document but not frequently within the corpus receive a higher weighting as these words are assumed to contain more meaning in relation to the document.

To achieve this, we will downweight the terms that occur frequently accross the documents. This is done by computing the tf-idf Statistics.

#dtm <- weightTfIdf(dtm, normalize = TRUE)
dtm.matrix = as.matrix(dtm)
#wordcloud(colnames(dtm.matrix), dtm.matrix[28, ], max.words = 20)
#inspect(dtm)
write.csv((as.matrix(dtm)), "test.csv")

Below is the list of most important terms for the Finding Dory movie as determined using tf-idf weighting.

#head(sort(as.matrix(dtm)[1,], decreasing = TRUE), n=15)
wordcloud(colnames(dtm.matrix), dtm.matrix[3, ], max.words = 200)

Calculating Distance

Next we calculate the euclidean distance between the documents. This distance is what the Clustering algorithm uses to cluster documents.

First, DTM needs to be converted to a Standard R Matrix that can be consumed by dist

m  <- as.matrix(dtm)
# # # m <- m[1:2, 1:3]
distMatrix <- dist(m, method="euclidean")
#print(distMatrix)
#distMatrix <- dist(m, method="cosine")
#print(distMatrix)

Clustering

The first algorithm we’ll look at is hierarchical clustering.

The R algorithm we’ll use is hclust which does agglomerative hierarchical clustering. Here’s a simplified description of how it works:

Assign each document to its own (single member) cluster
Find the pair of clusters that are closest to each other (dist) and merge them. So you now have one cluster less than before.
Compute distances between the new cluster and each of the old clusters.
Repeat steps 2 and 3 until you have a single cluster containing all documents.
groups <- hclust(distMatrix,method="ward.D")
plot(groups, cex=0.9, hang=-1)
rect.hclust(groups, k=5)

LS0tCnRpdGxlOiAiRG9jdW1lbnQgQ2x1c3RlcmluZyB3aXRoIFIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgTW90aXZhdGlvbgoKQW5hbHl6ZSB0aGUgdGhlIHVuZGVybHlpbmcgc3RydWN0dXJlIG9mIGRvY3VtZW50cyAodGV4dCkgaW4gYSBxdWFudGl0YXRpdmUgbWFubmVyLiAKClRoZSBkYXRhc2V0IEkgdXNlZCBpcyBhIFdpa2lwZWRpYSBwYWdlcyBvZiBzZXZlcmFsIEFuaW1hdGlvbiBNb3ZpZXMuIENsdXN0ZXJpbmcgd2FzIHBlcmZvcm1lZCB0byBncm91cCB0aGUgbW92aWVzIHRvZ2V0aGVyLiBUaGUgb3V0cHV0IG9mIHN1Y2ggYW5hbHlzaXMgY2FuIGJlIHVzZWQgZm9yIFJlY29tbWVuZGF0aW9ucyBvZiBzaW1pbGFyIG1vdmllIHRpdGxlcy4gCgojIE1ldGhvZAoKIyMgUmVxdWlyZWQgTGlicmFyaWVzCldlIHN0YXJ0IGJ5IGxvYWRpbmcgdGhlIGxpYnJhcmllcyBuZWVkZWQgZm9yIFRleHQgTWluaW5nIGFuZCBDbHVzdGVyaW5nLCBhbmQgV29yZENsb3VkIExpYnJhcnkgZm9yIHZpc3VhbGl6YXRpb24gb2YgQ2x1c3RlcmVkIFJlc3VsdHMKCmBgYHtyfQoKbGlicmFyeSh0bSkKbGlicmFyeShwcm94eSkKbGlicmFyeShSVGV4dFRvb2xzKQpsaWJyYXJ5KGZwYykgICAKbGlicmFyeSh3b3JkY2xvdWQpCmxpYnJhcnkoY2x1c3RlcikKbGlicmFyeSh0bSkKbGlicmFyeShzdHJpbmdpKQpsaWJyYXJ5KHByb3h5KQpsaWJyYXJ5KHdvcmRjbG91ZCkKYGBgCgoKIyMgUmVhZGluZyBEYXRhCk5leHQgd2UgbG9hZCB0aGUgZGF0YSBpbnRvIHRoZSBSIENvcnB1cyBEYXRhIFN0cnVjdHVyZS4gSW4gdGhpcyBjYXNlIHRoZSBkYXRhIGlzIHNldCBvZiBQbG90IFN1bW1hcmllcyBmcm9tIHZhcmlvdXMgQW5pbWF0aW9uIE1vdmllcyBvYnRhaW5lZCBmcm9tIFdpa2lwZWRpYQoKYGBge3J9CgpwYXRoID0gIi9ob21lL3NhcWliL21sX2F0X2JlcmtlbGV5L0RvY3VtZW50Q2x1c3RlcmluZy9jb3JwdXMiCmRpciA9IERpclNvdXJjZShwYXN0ZShwYXRoLCIvQXJjaGl2ZS8iLHNlcD0iIiksIGVuY29kaW5nID0gIlVURi04IikKI2NvcnB1cyA9IENvcnB1cyhkaXIsIHJlYWRlckNvbnRyb2w9bGlzdChyZWFkZXI9cmVhZFBERikpCmNvcnB1cyA9IENvcnB1cyhkaXIpCnN1bW1hcnkoY29ycHVzKQpgYGAKCiMjIERvY3VtZW50IFRlcm0gTWF0cml4IChEVE0pICAKVGhlIG5leHQgc3RlcCBpcyB0byBjcmVhdGUgYSBEb2N1bWVudC1UZXJtIE1hdHJpeCAoRFRNKS4gIERUTSBpcyBhIG1hdHJpeCB0aGF0IGxpc3RzIGFsbCBvY2N1cnJlbmNlcyBvZiB3b3JkcyBpbiB0aGUgY29ycHVzLiAgSW4gYSBEVE0sIGRvY3VtZW50cyBhcmUgcmVwcmVzZW50ZWQgYnkgcm93cyBhbmQgdGhlIHRlcm1zIChvciB3b3JkcykgYnkgY29sdW1ucy4gIElmIGEgd29yZCBvY2N1cnMgaW4gYSBwYXJ0aWN1bGFyIGRvY3VtZW50IG4gdGltZXMsIHRoZW4gdGhlIG1hdHJpeCBlbnRyeSBmb3IgY29ycmVzcG9uZGluZyB0byB0aGF0IHJvdyBhbmQgY29sdW1uIGlzIG4sIGlmIGl0IGRvZXNu4oCZdCBvY2N1ciBhdCBhbGwsIHRoZSBlbnRyeSBpcyAwLgoKYGBge3J9CgoKbmRvY3MgPC0gbGVuZ3RoKGNvcnB1cykKIyBpZ25vcmUgZXh0cmVtZWx5IHJhcmUgd29yZHMgaS5lLiB0ZXJtcyB0aGF0IGFwcGVhciBpbiBsZXNzIHRoZW4gMSUgb2YgdGhlIGRvY3VtZW50cwptaW5UZXJtRnJlcSA8LSBuZG9jcyAqIDAuMDEKIyBpZ25vcmUgb3Zlcmx5IGNvbW1vbiB3b3JkcyBpLmUuIHRlcm1zIHRoYXQgYXBwZWFyIGluIG1vcmUgdGhhbiA1MCUgb2YgdGhlIGRvY3VtZW50cwptYXhUZXJtRnJlcSA8LSBuZG9jcyAqIC41CgpkdG0gPSBEb2N1bWVudFRlcm1NYXRyaXgoY29ycHVzLAogICAgICAgICAgICAgICAgICAgICAgICAgY29udHJvbCA9IGxpc3QoCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBzdG9wd29yZHMgPSBUUlVFLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHdvcmRMZW5ndGhzPWMoNCwgMTUpLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlUHVuY3R1YXRpb24gPSBULAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgcmVtb3ZlTnVtYmVycyA9IFQsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAjc3RlbW1pbmcgPSBULAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgYm91bmRzID0gbGlzdChnbG9iYWwgPSBjKG1pblRlcm1GcmVxLCBtYXhUZXJtRnJlcSkpCgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICApKQoKI2R0bSA8LSBkdG1bLCBuYW1lcyhoZWFkKHNvcnQoY29sU3Vtcyhhcy5tYXRyaXgoZHRtKSkpLCA0MDApKV0KI2R0bSA8LSBkdG1bLCBuYW1lcyhzb3J0KGNvbFN1bXMoYXMubWF0cml4KGR0bSkpKSldCiNwcmludChhcy5tYXRyaXgoZHRtKSkKd3JpdGUuY3N2KChhcy5tYXRyaXgoZHRtKSksICJ0ZXN0LmNzdiIpCgoKCiNoZWFkKHNvcnQoYXMubWF0cml4KGR0bSlbMTgsXSwgZGVjcmVhc2luZyA9IFRSVUUpLCBuPTE1KQoKZHRtLm1hdHJpeCA9IGFzLm1hdHJpeChkdG0pCgojd29yZGNsb3VkKGNvbG5hbWVzKGR0bS5tYXRyaXgpLCBkdG0ubWF0cml4WzI4LCBdLCBtYXgud29yZHMgPSAyMCkKCmBgYAoKTGV0J3MgaW5zcGVjdCB0aGUgZ2VuZXJhdGUgRG9jdW1lbnQgVGVybSBNYXRyaXguIAoKTm90aWNlIHRoYXQgdGhlIFNwYXJzaXR5IGlzIGZhaXJseSBoaWdoLiBUaGlzIGlzIGdvb2QsIGFzIHRoaXMgaW5kaWNhdGVzIHRoYXQgdGhlcmUgYXJlIHVuaXF1ZSB0ZXJtcyB0aGUgZG9jdW1lbnRzIChtb3ZpZSBzeW5vcHNpcykgdGhhdCBhcmUgbm90IHByZXNlbnQgaW4gQUxMIHRoZSBkb2N1bWVudHMuIFRoaXMgaXMgdXNlZnVsIGZvciBzZXBhcmF0aW5nIChkaXN0YW5jZSkgdGhlIGRvY3VtZW50cy4KCmBgYHtyfQppbnNwZWN0KGR0bSkKYGBgCgoKIyMgVGVybSBmcmVxdWVuY3ktaW52ZXNlIGRvY3VtZW50IGZyZXF1ZWN5ICh0Zi1pZGYpCgpJbiB0aGUgYWJvdmUgRFRNIHlvdSB3aWxsIG5vdGljZSB0aGF0IHRlcm1zIHRoYXQgb2NjdXIgZnJlcXVlbnRseSBoYXZlIGEgaGlnaCB2YWx1ZSBhc3NvY2lhdGVkLiBIb3dldmVyLCBhIGNlcnRhaW4gdGVybSdzIGhpZ2ggZnJlcXVlbmN5IHdpdGhpbiBhIGRvY3VtZW50IG1lYW5zIGxpdHRsZSBpZiB0aGF0IHRlcm1zIGFwcGVhciBmcmVxdWVudHkgaW4gb3RoZXIgZG9jdW1lbnRzIGluIHRoZSBjb3JwdXMuIEluIG90aGVyIHdvcmRzLCB0ZXJtcyB0aGF0IG9jY3VyIGZyZXF1ZW50bHkgd2l0aGluIGEgZG9jdW1lbnQgYnV0IG5vdCBmcmVxdWVudGx5IHdpdGhpbiB0aGUgY29ycHVzIHJlY2VpdmUgYSBoaWdoZXIgd2VpZ2h0aW5nIGFzIHRoZXNlIHdvcmRzIGFyZSBhc3N1bWVkIHRvIGNvbnRhaW4gbW9yZSBtZWFuaW5nIGluIHJlbGF0aW9uIHRvIHRoZSBkb2N1bWVudC4KClRvIGFjaGlldmUgdGhpcywgd2Ugd2lsbCBkb3dud2VpZ2h0IHRoZSB0ZXJtcyB0aGF0IG9jY3VyIGZyZXF1ZW50bHkgYWNjcm9zcyB0aGUgZG9jdW1lbnRzLiBUaGlzIGlzIGRvbmUgYnkgY29tcHV0aW5nIHRoZSB0Zi1pZGYgU3RhdGlzdGljcy4KCmBgYHtyfQoKI2R0bSA8LSB3ZWlnaHRUZklkZihkdG0sIG5vcm1hbGl6ZSA9IFRSVUUpCmR0bS5tYXRyaXggPSBhcy5tYXRyaXgoZHRtKQojd29yZGNsb3VkKGNvbG5hbWVzKGR0bS5tYXRyaXgpLCBkdG0ubWF0cml4WzI4LCBdLCBtYXgud29yZHMgPSAyMCkKI2luc3BlY3QoZHRtKQp3cml0ZS5jc3YoKGFzLm1hdHJpeChkdG0pKSwgInRlc3QuY3N2IikKYGBgCgoKQmVsb3cgaXMgdGhlIGxpc3Qgb2YgbW9zdCBpbXBvcnRhbnQgdGVybXMgZm9yIHRoZSBGaW5kaW5nIERvcnkgbW92aWUgYXMgZGV0ZXJtaW5lZCB1c2luZyB0Zi1pZGYgd2VpZ2h0aW5nLgpgYGB7cn0KI2hlYWQoc29ydChhcy5tYXRyaXgoZHRtKVsxLF0sIGRlY3JlYXNpbmcgPSBUUlVFKSwgbj0xNSkKd29yZGNsb3VkKGNvbG5hbWVzKGR0bS5tYXRyaXgpLCBkdG0ubWF0cml4WzMsIF0sIG1heC53b3JkcyA9IDIwMCkKCmBgYAoKCiMjIENhbGN1bGF0aW5nIERpc3RhbmNlCgpOZXh0IHdlIGNhbGN1bGF0ZSB0aGUgZXVjbGlkZWFuIGRpc3RhbmNlIGJldHdlZW4gdGhlIGRvY3VtZW50cy4gVGhpcyBkaXN0YW5jZSBpcyB3aGF0IHRoZSBDbHVzdGVyaW5nIGFsZ29yaXRobSB1c2VzIHRvIGNsdXN0ZXIgZG9jdW1lbnRzLgoKRmlyc3QsIERUTSBuZWVkcyB0byBiZSBjb252ZXJ0ZWQgdG8gYSBTdGFuZGFyZCBSIE1hdHJpeCB0aGF0IGNhbiBiZSBjb25zdW1lZCBieSBkaXN0CgpgYGB7cn0KbSAgPC0gYXMubWF0cml4KGR0bSkKIyAjICMgbSA8LSBtWzE6MiwgMTozXQpkaXN0TWF0cml4IDwtIGRpc3QobSwgbWV0aG9kPSJldWNsaWRlYW4iKQojcHJpbnQoZGlzdE1hdHJpeCkKI2Rpc3RNYXRyaXggPC0gZGlzdChtLCBtZXRob2Q9ImNvc2luZSIpCiNwcmludChkaXN0TWF0cml4KQpgYGAKCgojIyBDbHVzdGVyaW5nCgpUaGUgZmlyc3QgYWxnb3JpdGhtIHdl4oCZbGwgbG9vayBhdCBpcyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZy4gCgpUaGUgUiBhbGdvcml0aG0gd2XigJlsbCB1c2UgaXMgaGNsdXN0IHdoaWNoIGRvZXMgYWdnbG9tZXJhdGl2ZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZy4gSGVyZeKAmXMgYSBzaW1wbGlmaWVkIGRlc2NyaXB0aW9uIG9mIGhvdyBpdCB3b3JrczoKCiAgICBBc3NpZ24gZWFjaCBkb2N1bWVudCB0byBpdHMgb3duIChzaW5nbGUgbWVtYmVyKSBjbHVzdGVyCiAgICBGaW5kIHRoZSBwYWlyIG9mIGNsdXN0ZXJzIHRoYXQgYXJlIGNsb3Nlc3QgdG8gZWFjaCBvdGhlciAoZGlzdCkgYW5kIG1lcmdlIHRoZW0uIFNvIHlvdSBub3cgaGF2ZSBvbmUgY2x1c3RlciBsZXNzIHRoYW4gYmVmb3JlLgogICAgQ29tcHV0ZSBkaXN0YW5jZXMgYmV0d2VlbiB0aGUgbmV3IGNsdXN0ZXIgYW5kIGVhY2ggb2YgdGhlIG9sZCBjbHVzdGVycy4KICAgIFJlcGVhdCBzdGVwcyAyIGFuZCAzIHVudGlsIHlvdSBoYXZlIGEgc2luZ2xlIGNsdXN0ZXIgY29udGFpbmluZyBhbGwgZG9jdW1lbnRzLgoKCgpgYGB7cn0KZ3JvdXBzIDwtIGhjbHVzdChkaXN0TWF0cml4LG1ldGhvZD0id2FyZC5EIikKcGxvdChncm91cHMsIGNleD0wLjksIGhhbmc9LTEpCnJlY3QuaGNsdXN0KGdyb3Vwcywgaz01KQpgYGA=