Word Cloud formation for text analytics using R

As a common approach to text analytics, word cloud formation is used to summarise textual data.

To make a wordcloud, the major steps include:-

Creation of a corpus. Corpora are collection of documents containing natural language text.
Cleaning of the corpus by removing punctuations, numbers, unnecessary words, etc (as required)
Calculation of word frequencies
Creation of word cloud

In this code, nycflights13 dataset will be used to create a word cloud of names of airports. In case we have text file for analysis, the same can also be used to create a corpora using Corpus() function.

Load library

library(tidyverse)

## -- Attaching packages ------------------

## v ggplot2 3.1.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(wordcloud)

## Loading required package: RColorBrewer

library(nycflights13)
library("tm")

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

Read data

dat = nycflights13::airports

Standardise labels

names(dat) = epitrix::clean_labels(names(dat))

The class of the text vector for wordcloud should always be character

class(dat$name)

## [1] "character"

Creating a corpus.

word_corpus = Corpus(VectorSource(dat$name))

Cleaning the corpus

word_clean_corpus = word_corpus %>% 
  tm_map(removePunctuation) %>% #to remove punctuations, if any
  tm_map(removeNumbers) %>% #to remove numbers, if any
  tm_map(stripWhitespace) %>% #to convert multiple whitrespaces into a single whitespace
  tm_map(tolower) ##make all words lowercase

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., tolower): transformation drops documents

#to remove specific words, if required
#tm_map(concrete_maxset.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm")) 
#Stemming can also be done to focus on base words rather than its variants, if required. FOr example,
# pen and pencil has 'pen' as core word. The same if required can be done using 
#tm_map(stemDocument)

Calculating term frequencies

word_count = as.matrix(TermDocumentMatrix(word_clean_corpus))
word_frequencies = sort(rowSums(word_count), decreasing = T)
head(word_frequencies)

##   airport      intl  regional municipal    county     field 
##       629       145       123       117       111        72

Set seed for reproducibility

set.seed(5197)

Making wordcloud

wordcloud(words = names(word_frequencies), 
          freq = word_frequencies,
          min.freq = 3, #the minimum frequency of word to be selected
          scale = c(4,.5), #the scale of most common to least
          max.words = 200, #no of words to be displayed
          random.order = T, #to keep words randomly
          colors = brewer.pal(12,"Paired")) #to add colors

Word Cloud formation for text analytics using R

Dr Gurpreet Singh, Dr Biju Soman

21/12/2019