topwords

Veronika N
26.20.2014

Calculating top words in a webpage

Application purpose

The application calculates the top used words in a web-page provided by a user

The application can be used for:

  • choosing right keywords
  • semantic analysis of web page content, especially when the user needs to read automatically the content of multiple webpages

Code development: extract website content

The major steps in developing code of this applications are:

  1. Extract properly the content of the website. There're multiple ways of content extraction, I have implemented through the functions of RCurl and XML libraries

  2. Exclude non-informative words. For this I have used stopwords_en list from lsa library and also deleted words <2 letters and >15 letters (this is usually html code remainings)

  3. Calculate the mostly used words in the website. This can be easily done through turning the content into words factor format and summary function

Application examples

As an example I want to use the article from a magazine to demonstrate how top words provide the idea about the article

Article: “Ghosts in the machine language” http://www.economist.com/news/science-and-technology/21627868-latest-hacks-and-exploits-result-benign-neglect-and-wont-be-last-ghosts-machine

Output

           count
economist     26
software      10
technology    10
devices        8
economics      8
science        8
october        7
world          7
americas       6
business       6
view           6
web            6
exploit        5
code           4
culture        4