Scalable Text Mining for Large Escanning Datsets

Synopsis:

Harvested the original data of approximately 5500 e-scanned articles from IBM Watson to the E-scanning database (called Mongo), and saved in disk drive. Sandard text processing and cleaning (“the”, “and”, “ing”, “;”, “:” etc.,) simplified the ensuing three types text analyses using R text mining packages.

DATA PROCESSING

docs <- tm_map(docs, removePunctuation) 
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, stemDocument) #Removing common word endings (e.g., "ing", "es", "s") 
dtm <- DocumentTermMatrix(docs) #create matrices that help the futher analysis
tdm <- TermDocumentMatrix(docs) #create matrices that help the futher analysis
tdmss <- removeSparseTerms(tdm, 0.06) # This makes a matrix that is 6% empty space, maximum.

ANALYSIS

2. Clustering

Creating a hierrarchichal cluster dendrogram using rafalib

d<-dist(tdm,method = "euclidean")#distance matrix
hc<-hclust(d, method="ward.D2")
library("rafalib")
myplclust(hc, labels=hc$labels)