Creating a wordcloud of a blog

Introduction of wordcloud and ways to create it

Wordcloud is one of the recent trends in the advertising and marketing world. It is a picture of a number of words arranged randomly where the most important aspects of the product or service are in larger fonts and the less important words are arranged in the order of descending font size. They are used in websites, blogs, posts, advertising in social media as well as in agencies and seminars. Wordclouds can also be useful to highlight the prominent words in a piece of text. Wordclouds are often referred to as “tag”clouds in recent years. It is due to the fact that the tag clouds are being used in search engine optimization.

Word clouds have been used in visualization of word frequency in a given text. The process of forming a word cloud includes

Cleaning the text
Extracting tag words from the cleaned text
Counting the frequency of tag words
Arranging the tag words in a wordcloud

There are several libraries that are used to form the word cloud. This document explains the process of forming word cloud by parsing blog article whose URL is given. It assumes that the blog is written in optimized HTML5 format, where the blog text is inside the* article tag.

Introduction of the packages/libraries used and their installation

A number of packages have been used to extract, transform and form the wordcloud. The packages that are used with their general introduction are as follows:

rvest

This package is easily used to scrape web pages in R. Its description according to R Studio’s documentation is “Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML”.

This library is included in base R, so you dont need to install it

dplyr

This package aims to give the following (from R Studio’s documentation):

Identify the most important data manipulation verbs and make them easy to use from R.
Provide blazing fast performance for in-memory data by writing key pieces in C++ (using Rcpp)
Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

This library is included in base R, so you dont need to install it

tm

tm is a framework for text mining applications in R. It has a number of functions that help in text mining.

install.packages("tm", repos = "http://cran.us.r-project.org")

SnowballC

CRAN explains SnowballC as “An R interface to the C ‘libstemmer’ library that implements Porter’s word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary”

install.packages("SnowballC", repos = "http://cran.us.r-project.org")

wordcloud

This package is the main package that creates wordcloud. It is explained in CRAN as “Functionality to create pretty word clouds, visualize differences and similarity be- tween documents, and avoid over-plotting in scatter plots with text”.

install.packages("wordcloud", repos = "http://cran.us.r-project.org")

RColorBrewer

RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner.

install.packages("RColorBrewer", repos = "http://cran.us.r-project.org")

Using the packages to create a wordcloud (step by step guide)

Loading the libraries

After having all your packages ready, the next step is to load them one by one so that the functions that they contain can be used in your code. The following code loads the libraries/packages mentioned above.

library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(rvest)
library(dplyr)

Grabbing the text of the blog article

The next is then to grab the URL of the blog that you want to make a wordcloud of and save it in a variable. Then, save the html contents of that page using read_html code. We will be using the article: https://blog.feedspot.com/australia_blogs/

  web_address <- 'https://blog.feedspot.com/australia_blogs/'
  webpage_code <- read_html(web_address)

Now you have the entire html of the page. You will have to extract only the blog article and leave the rest behind. The following function takes care of that.

  full_article <- html_nodes(webpage_code,'article')

The above code searches for the article tag and saves its contents to the variable full_article.

The introduction of HTML5 has enabled front end developers to explicitly define their article in a blog or a news page inside it. This will increase the visibility of the article in a page by the search engines. Most of the modern blogs use this mechanism to store the text part of their blog, and hence we are going to exploit it.

The next step is to clean the article that we have inside our variable full_article. To do this, we go through a series of steps mentioned in the code and comments below.

The first step of transformation is to clear any html tags if present in the article.

  article_text <- html_text(full_article)

Then this is followed by creating a corpus object of the text. Converting the text to corpus helps to perform text manipulations.

Converting the text to corpus and using text mining tools on it

  docs <- Corpus(VectorSource(article_text))

Lets see what type of object we have created docs

  inspect(docs)

We can see that our docs variable is now a SimpleCorpus data type with one document in its content. We proceed to modify it by first creating a function to replace the text.

  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

The above variable, toSpace has a defined function content_transformer which takes a pattern and replaces it with a space " " i.e. it removes the corpus that matches the given pattern. We will use the toSpace variable and pass our text and pattern to be removed as follows.

Removing unnecessary characters

  docs <- tm_map(docs, toSpace, "/")    #replace / with space
  docs <- tm_map(docs, toSpace, "@")    #replace @ with space
  docs <- tm_map(docs, toSpace, "\\|")  #replace | with space

Now we will clean the text, remove unnecessary whitespaces, convert the text to lower case, remove common stopwords like ‘to’ and ‘is’ as such words are not useful for analyses. For ‘stopwords’ only a number of languages are supported. We shall remove numbers punctuations also.

Cleaning the text

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers) 
# Remove punctuations 
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords('english'))

After cleaning the text, we also need to stem the words to their root words; for example stemming will reduce words like juggles, juggled to the root word juggle. Stemming words will make text analyses more accurate.

  # Text stemming
   docs <- tm_map(docs, stemDocument)

## Warning in tm_map.SimpleCorpus(docs, stemDocument): transformation drops
## documents

Getting the frequencies of the words

At this point, our text is ready for counting. We will now build a document matrix of terms which has the frequency of our cleaned words. The function TermDocumentMatrix creates a matrix of text word and frequency as displayed below.

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##                  word freq
## blog             blog  168
## australia   australia  116
## follow         follow   74
## post             post   72
## twitter       twitter   69
## fan               fan   68
## per               per   68
## monthblog   monthblog   30
## australian australian   29
## travel         travel   29

Generating the wordcloud

Finally, we can generate our word cloud using the code below. The code starts with set.seed(1234), which generates a random objects that can be reproduced. Tip. Use the set.seed function when running simulations to ensure all results, figures, etc are reproducible.

The function wordcloud takes parameters

word,
frequency of word,
minimum frequency of word that is to be included in the wordcloud,
the maximum number of words to include in the wordcloud,
order; they will be plotted in decreasing frequency if set to false
proportion of words with 90 degree rotation
color of the words from least to more frequent

set.seed(500)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.25, 
          colors=brewer.pal(8, "Dark2"))