Wordcloud Diarium D9

Text Mining and Word CLoud of Diarium Survey

Text mining methods allow us to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud, which is a visual representation of text data.

3 reasons you should use word clouds to present your text data

Word clouds add simplicity and clarity. The most used keywords stand out better in a word cloud
Word clouds are a potent communication tool. They are easy to understand, to be shared and are impactful
Word clouds are visually engaging than a table data

Load Library

library("tm")

## Warning: package 'tm' was built under R version 3.5.2

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.5.2

library("SnowballC")

## Warning: package 'SnowballC' was built under R version 3.5.2

library("wordcloud")

## Warning: package 'wordcloud' was built under R version 3.5.2

## Loading required package: RColorBrewer

library("RColorBrewer")
library("plyr")
library("class")

Load the data as a corpus

text <- readLines("D:/Hasil Diarium UX Survey text D-9.txt")
# Load the data as a corpus
docs <- Corpus(VectorSource(text))

Text transformation

Transformation is performed using tm_map() function to replace, for example, special characters from the text.

Replacing “/”, “@” and “|” with space:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")

## Warning in tm_map.SimpleCorpus(docs, toSpace, "/"): transformation drops
## documents

docs <- tm_map(docs, toSpace, "@")

## Warning in tm_map.SimpleCorpus(docs, toSpace, "@"): transformation drops
## documents

docs <- tm_map(docs, toSpace, "\\|")

## Warning in tm_map.SimpleCorpus(docs, toSpace, "\\|"): transformation drops
## documents

Cleaning the text

the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”.

You could also remove numbers and punctuation with removeNumbers and removePunctuation arguments.

Another important preprocessing step is to make a text stemming which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents

# Remove numbers
docs <- tm_map(docs, removeNumbers)

## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
## transformation drops documents

# Remove your own stop wo
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2", "tidak","ada","dan","saja"))

## Warning in tm_map.SimpleCorpus(docs, removeWords, c("blabla1", "blabla2", :
## transformation drops documents

# Remove punctuations
docs <- tm_map(docs, removePunctuation)

## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation
## drops documents

# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents

# Text stemming
# docs <- tm_map(docs, stemDocument)

Build a term-document matrix

Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow :

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##              word freq
## diarium   diarium  205
## lebih       lebih  199
## aplikasi aplikasi  183
## yang         yang  123
## bisa         bisa  119
## untuk       untuk  107
## update     update   99
## fitur       fitur   98
## dengan     dengan   98
## belum       belum   81

Generate the Word cloud

The importance of words can be illustrated as a word cloud as follow :

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
          max.words=1000, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Go further

Explore frequent terms and their associations

You can have a look at the frequent terms in the term-document matrix as follow. In the example below we want to find words that occur at least 50 times :

findFreqTerms(dtm, lowfreq = 50)

##  [1] "fitur"    "untuk"    "aplikasi" "dengan"   "diarium"  "sudah"   
##  [7] "update"   "belum"    "agar"     "lebih"    "bisa"     "yang"    
## [13] "mudah"

You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “notfikasi” in the text data :

findAssocs(dtm, terms = "ada", corlimit = 0.3)

## $ada
## numeric(0)

findAssocs(dtm, terms = "update", corlimit = 0.3)

## $update
##    versi download 
##     0.31     0.31

The frequency table of words

head(d, 20)

##              word freq
## diarium   diarium  205
## lebih       lebih  199
## aplikasi aplikasi  183
## yang         yang  123
## bisa         bisa  119
## untuk       untuk  107
## update     update   99
## fitur       fitur   98
## dengan     dengan   98
## belum       belum   81
## agar         agar   74
## sudah       sudah   64
## mudah       mudah   64
## tampilan tampilan   48
## menu         menu   47
## saat         saat   44
## semua       semua   44
## ini           ini   43
## sering     sering   43
## absensi   absensi   43

Plot word frequencies

The frequency of the first 10 frequent words are plotted :

barplot(d[1:20,]$freq, las = 2, names.arg = d[1:20,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")