Understanding 1500_49

Step1: Setup

Included in the Setup is the file “htmlToText” written by Tony Breyal, that provides functionality to clean tags from html files

library(tm)
library(RCurl)
library(XML)
library(wordcloud)
library(RColorBrewer)

Load data, remove tags and make sure all text is in readable form. Once this is done create the Corpus

html2txt <- lapply("1500_49.html", htmlToText)
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub = ""))
corpus <- Corpus(VectorSource(html2txtclean))

Clean the document by removing common words (from data in tm library), put all words in lower case, remove punctuation, numbers, white spaces

skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
wd <- tm_map(corpus, PlainTextDocument)
wd <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)

Step2: Document Analysis

Look at word frequency for words that have between 3 and 10 characters and look at all the words that appear more than 10 times

wd.dtm1 <- TermDocumentMatrix(wd, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(wd.dtm1, lowfreq = 10)
newstopwords

##  [1] "articles" "edge"     "edges"    "federal"  "glass"    "inch"    
##  [7] "mandrel"  "metal"    "probe"    "section"  "shall"    "sharp"   
## [13] "surface"  "tape"     "test"     "the"      "used"

Create a plot to find hierarchical clusters of most frequent words (association)

wd.dtm2 <- wd.dtm1[!(wd.dtm1$dimnames$Terms) %in% newstopwords,]
wd.dtm3 <- removeSparseTerms(wd.dtm2, sparse = 0.7)
wd.dtm.df <- as.data.frame(inspect(wd.dtm3))

## <<TermDocumentMatrix (terms: 480, documents: 1)>>
## Non-/sparse entries: 480/0
## Sparsity           : 0%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##            Docs
## Terms       1
##   collar    7
##   degrees   7
##   dimension 8
##   for       7
##   force     8
##   intended  8
##   minor     8
##   opening   8
##   paragraph 8
##   toys      9

wd.dtm.df.scale <- scale(wd.dtm.df)
d <- dist(wd.dtm.df.scale, method = "euclidian")
fit <- hclust(d, method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(fit)

Do a wordcloud to see word relevance within the document

m = as.matrix(t(wd.dtm1))
word_freqs = sort(colSums(m), decreasing = TRUE)
dm = data.frame(word=names(word_freqs), freq=word_freqs)
wordcloud(dm$word, dm$freq, random.order = FALSE, colors=brewer.pal(8,"Dark2"))

Step 3: Conclusion

From this analysis what seems to be important to check revolves around the following words: Edge, sharp, mandrel, metal, glass, tape, federal, and section.

On a second level we can find the following words: toys, intended, surface, minor, collar, opening, force.

For further analysis and comprehension of the text, I would look at bi-grams’ frequency and with more time I would do a context analysis with other tools.