Step 1: Set Up the R Environment and Importing the Raw Data

This tutorial is adapted from the website: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

In R, wordclouds can be generated using the following packages. Here we install and load the packages then we read in a text file with our collection of words. The text file is then transformed into a corpus data structure.

# Install
#install.packages("tm")  # for text mining
# install.packages("SnowballC") # for text stemming
# install.packages("wordcloud") # word-cloud generator 
# install.packages("RColorBrewer") # color palettes
# Load
library("tm")
## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
library("SnowballC")
library("wordcloud")
## Loading required package: RColorBrewer
library("RColorBrewer")

text <- readLines(file.choose())

# Or read the text file from internet
#filePath <- "worddata.txt"
#text <- readLines(filePath)

# Load the data as a corpus
docs <- Corpus(VectorSource(text))

#inspect(docs)

Step 2: Cleaning and Structuring the Raw Data

Next, we remove the punctuation, numbers, and unnecessary words. The cleaned data is then transformed into a matrix and words are sorted by frequency in decreasing order. The words are then extracted from the matrix and are ready to use with wordclouds and other visualizations.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
##            word freq
## good       good  914
## love       love  761
## haha       haha  624
## just       just  590
## call       call  478
## morning morning  476
## get         get  465
## like       like  390
## day         day  383
## going     going  367

Step 3: Generating the Wordcloud

The wordcloud is ready to be generated and we do so using the wordcloud() function. This function can randomly display the words, but here, we want to display in order of decreasing frequency.

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Step 4: Counting and Displaying Term Frequencies

Finally, we display the count of term frequencies and a barplot of the ten most frequent words.

findAssocs(dtm, terms = "freedom", corlimit = 0.3)
## $freedom
## numeric(0)
head(d, 10)
##            word freq
## good       good  914
## love       love  761
## haha       haha  624
## just       just  590
## call       call  478
## morning morning  476
## get         get  465
## like       like  390
## day         day  383
## going     going  367
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")