Synopsis:
Harvested the original data of approximately 5500 e-scanned articles from IBM Watson to the E-scanning database (called Mongo), and saved in disk drive. Sandard text processing and cleaning (“the”, “and”, “ing”, “;”, “:” etc.,) simplified the ensuing three types text analyses using R text mining packages.
DATA PROCESSING
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument) #Removing common word endings (e.g., "ing", "es", "s")
dtm <- DocumentTermMatrix(docs) #create matrices that help the futher analysis
tdm <- TermDocumentMatrix(docs) #create matrices that help the futher analysis
tdmss <- removeSparseTerms(tdm, 0.06) # This makes a matrix that is 6% empty space, maximum.
ANALYSIS
2. Clustering
Creating a hierrarchichal cluster dendrogram using rafalib
d<-dist(tdm,method = "euclidean")#distance matrix
hc<-hclust(d, method="ward.D2")
library("rafalib")
myplclust(hc, labels=hc$labels)
