This tutorial is adapted from the website: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
In R, wordclouds can be generated using the following packages. Here we install and load the packages then we read in a text file with our collection of words. The text file is then transformed into a corpus data structure.
# Install
#install.packages("tm") # for text mining
# install.packages("SnowballC") # for text stemming
# install.packages("wordcloud") # word-cloud generator
# install.packages("RColorBrewer") # color palettes
# Load
library("tm")
## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
library("SnowballC")
library("wordcloud")
## Loading required package: RColorBrewer
library("RColorBrewer")
text <- readLines(file.choose())
# Or read the text file from internet
#filePath <- "worddata.txt"
#text <- readLines(filePath)
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
#inspect(docs)
Next, we remove the punctuation, numbers, and unnecessary words. The cleaned data is then transformed into a matrix and words are sorted by frequency in decreasing order. The words are then extracted from the matrix and are ready to use with wordclouds and other visualizations.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
## word freq
## good good 914
## love love 761
## haha haha 624
## just just 590
## call call 478
## morning morning 476
## get get 465
## like like 390
## day day 383
## going going 367
The wordcloud is ready to be generated and we do so using the wordcloud() function. This function can randomly display the words, but here, we want to display in order of decreasing frequency.
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Finally, we display the count of term frequencies and a barplot of the ten most frequent words.
findAssocs(dtm, terms = "freedom", corlimit = 0.3)
## $freedom
## numeric(0)
head(d, 10)
## word freq
## good good 914
## love love 761
## haha haha 624
## just just 590
## call call 478
## morning morning 476
## get get 465
## like like 390
## day day 383
## going going 367
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")