Word Clouds

Always looking for ways to visualise data and in the spirit of fun, I thought I would share my first vignette on creating wordclouds using R and look at the frequency of words used in the Subject Outline for STDS and if an inference can be made by their frequency.

There are actually two packages on R, “wordcloud” and “wordcloud2” by Ian Fellows.

To get started you will need to ensure you have the following packages installed from the CRAN Library (“NLP”),(“tm”) and (“Snowball”) to do the text miming and cleaning and (“RColorbrewer”),(“wordcloud”) and (“wordcloud2”) for making the word Clouds. For the text I went to https://online.uts.edu.au/webapps/login/ and copied the Subject Outline for STDS to a text document and saved it to a tempory file in my working directory. Now we are ready to let R do its magic.

library(NLP)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(wordcloud2)
#create Corpus of docs
#Of course you will need to insert your own file path in the below command
#Be careful to make sure file only has Docs you want included, in this case there is only one
Docs<- Corpus(DirSource("C:/Users/jmcintosh/Documents/textmining1"))

So this first chunk of code got our libraries and the document loaded into r. Using inspect(Docs) will provide an overview of the corpus. Now we need to clean the documents. The “tm” and “snowball” packages have everthing you need to do this.

#remove punctuation
Docs<-tm_map(Docs, removePunctuation)
#change to lower case
Docs<- tm_map(Docs, content_transformer(tolower))
#Remove numbers
Docs<-tm_map(Docs, removeNumbers)
#Remove white space
Docs<- tm_map(Docs, stripWhitespace)
#Stem document
Docs<- tm_map(Docs, stemDocument)
#remove stop words
Docs<- tm_map(Docs,removeWords, stopwords("english"))

Great, if you are impatient like me you just want to get into the fun of creating a Word Cloud, if not you can persist with further cleaning and fix up words using content_transformer and mapping “other” characters “toSpace” that may have been missed that you want removed from the data. remember you can always go to the help screen to check out all the parameters and arguments for these tools.

Below is the code I used to create my first wordcloud, again check the function for all the auguments. Depending on scale you may get error messages if you do not manage the maximun number of words used, as R will advise no room to plot.I have also not made this a data.frame (yet!) so R will treat this like a vector and read the characters without me setting frequency.

wordcloud(Docs, scale = c(5, 0.5),max.words = 70,random.order = FALSE, rot.per = 0.35,
          use.r.layout = FALSE, colors = brewer.pal(6,"Dark2"))

Looks Ok and you can certainly get an insight to the words that occur the most, but…

Word Cloud 2

As noted above I did not set my frequency, so a few more steps to sort and convert to a data.frame to really get the benefits of “wordcloud2”. A quick look at the data.frame using head(Docs2) gives me a summary of the top words used and their frequency.

tdm<-TermDocumentMatrix(Docs)
m<-as.matrix(tdm)
v<-sort(rowSums(m),decreasing = TRUE)
Docs2<-data.frame(word=names(v),freq =v)
head(Docs2)
##            word freq
## data       data   51
## will       will   41
## assess   assess   35
## student student   29
## work       work   24
## project project   23

Yep, no surprises here, “data”" was the most frequently occuring word (51 times). So this is the same corpus but now with the included frequency in the data.frame we have an interactive Word Cloud. Drag your mouse over the words and it will highlight and provide a word count - how cool is that?

wordcloud2(Docs2, color = "random-dark",size = 0.8, shape = "circle", backgroundColor = "white") 

There are lots of other great features of this package and I encourage you to play with them. I hope you found this fun, informative and interesting and look forward to your comments.