Included in the Setup is the file “htmlToText” written by Tony Breyal, that provides functionality to clean tags from html files
library(tm)
library(RCurl)
library(XML)
library(wordcloud)
library(RColorBrewer)
Load data, remove tags and make sure all text is in readable form. Once this is done create the Corpus
html2txt <- lapply("1500_49.html", htmlToText)
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub = ""))
corpus <- Corpus(VectorSource(html2txtclean))
Clean the document by removing common words (from data in tm library), put all words in lower case, remove punctuation, numbers, white spaces
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
wd <- tm_map(corpus, PlainTextDocument)
wd <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
Look at word frequency for words that have between 3 and 10 characters and look at all the words that appear more than 10 times
wd.dtm1 <- TermDocumentMatrix(wd, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(wd.dtm1, lowfreq = 10)
newstopwords
## [1] "articles" "edge" "edges" "federal" "glass" "inch"
## [7] "mandrel" "metal" "probe" "section" "shall" "sharp"
## [13] "surface" "tape" "test" "the" "used"
Create a plot to find hierarchical clusters of most frequent words (association)
wd.dtm2 <- wd.dtm1[!(wd.dtm1$dimnames$Terms) %in% newstopwords,]
wd.dtm3 <- removeSparseTerms(wd.dtm2, sparse = 0.7)
wd.dtm.df <- as.data.frame(inspect(wd.dtm3))
## <<TermDocumentMatrix (terms: 480, documents: 1)>>
## Non-/sparse entries: 480/0
## Sparsity : 0%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1
## collar 7
## degrees 7
## dimension 8
## for 7
## force 8
## intended 8
## minor 8
## opening 8
## paragraph 8
## toys 9
wd.dtm.df.scale <- scale(wd.dtm.df)
d <- dist(wd.dtm.df.scale, method = "euclidian")
fit <- hclust(d, method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(fit)
Do a wordcloud to see word relevance within the document
m = as.matrix(t(wd.dtm1))
word_freqs = sort(colSums(m), decreasing = TRUE)
dm = data.frame(word=names(word_freqs), freq=word_freqs)
wordcloud(dm$word, dm$freq, random.order = FALSE, colors=brewer.pal(8,"Dark2"))
From this analysis what seems to be important to check revolves around the following words: Edge, sharp, mandrel, metal, glass, tape, federal, and section.
On a second level we can find the following words: toys, intended, surface, minor, collar, opening, force.
For further analysis and comprehension of the text, I would look at bi-grams’ frequency and with more time I would do a context analysis with other tools.