Jose A. Ruiperez Valiente
February 27, 2016
In this project I have developed a word cloud generator based on web scraping of wikipedia pages. It has the following features:
require(rvest)
wiki_url = "https://en.wikipedia.org/wiki/Statistics"
# I take the body if the page
text <- html_text(read_html(wiki_url) %>% html_nodes("body"))
# Clean
text <- gsub("\n", "", gsub("\t", "", text))
# Substring of the total text scraped from the wikipedia page.
substr(text, 182, 305)
[1] "Statistics used in standardized testing assessment are shown. The scales include standard deviations, cumulative percentages"
Use Corpus representation from tm package
myCorpus = Corpus(VectorSource(text))
myDTM = TermDocumentMatrix(myCorpus,control = list(minWordLength = 1))
myDTM = TermDocumentMatrix(myCorpus,control = list(minWordLength = 1))
m = as.matrix(myDTM)
# Showing the 5 more frequent words of Statistics wikipedia page
sort(rowSums(m), decreasing = TRUE)[1:5]
statistics statistical data hypothesis probability
87 79 77 44 35