Veronika N
26.20.2014
Calculating top words in a webpage
The application calculates the top used words in a web-page provided by a user
The application can be used for:
The major steps in developing code of this applications are:
Extract properly the content of the website. There're multiple ways of content extraction, I have implemented through the functions of RCurl and XML libraries
Exclude non-informative words. For this I have used stopwords_en list from lsa library and also deleted words <2 letters and >15 letters (this is usually html code remainings)
Calculate the mostly used words in the website. This can be easily done through turning the content into words factor format and summary function
As an example I want to use the article from a magazine to demonstrate how top words provide the idea about the article
Article: “Ghosts in the machine language” http://www.economist.com/news/science-and-technology/21627868-latest-hacks-and-exploits-result-benign-neglect-and-wont-be-last-ghosts-machine“
count
economist 26
software 10
technology 10
devices 8
economics 8
science 8
october 7
world 7
americas 6
business 6
view 6
web 6
exploit 5
code 4
culture 4