As a common approach to text analytics, word cloud formation is used to summarise textual data.
To make a wordcloud, the major steps include:-
In this code, nycflights13 dataset will be used to create a word cloud of names of airports. In case we have text file for analysis, the same can also be used to create a corpora using Corpus() function.
Load library
library(tidyverse)
## -- Attaching packages ------------------
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(wordcloud)
## Loading required package: RColorBrewer
library(nycflights13)
library("tm")
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
Read data
dat = nycflights13::airports
Standardise labels
names(dat) = epitrix::clean_labels(names(dat))
The class of the text vector for wordcloud should always be character
class(dat$name)
## [1] "character"
Creating a corpus.
word_corpus = Corpus(VectorSource(dat$name))
Cleaning the corpus
word_clean_corpus = word_corpus %>%
tm_map(removePunctuation) %>% #to remove punctuations, if any
tm_map(removeNumbers) %>% #to remove numbers, if any
tm_map(stripWhitespace) %>% #to convert multiple whitrespaces into a single whitespace
tm_map(tolower) ##make all words lowercase
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., tolower): transformation drops documents
#to remove specific words, if required
#tm_map(concrete_maxset.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm"))
#Stemming can also be done to focus on base words rather than its variants, if required. FOr example,
# pen and pencil has 'pen' as core word. The same if required can be done using
#tm_map(stemDocument)
Calculating term frequencies
word_count = as.matrix(TermDocumentMatrix(word_clean_corpus))
word_frequencies = sort(rowSums(word_count), decreasing = T)
head(word_frequencies)
## airport intl regional municipal county field
## 629 145 123 117 111 72
Set seed for reproducibility
set.seed(5197)
Making wordcloud
wordcloud(words = names(word_frequencies),
freq = word_frequencies,
min.freq = 3, #the minimum frequency of word to be selected
scale = c(4,.5), #the scale of most common to least
max.words = 200, #no of words to be displayed
random.order = T, #to keep words randomly
colors = brewer.pal(12,"Paired")) #to add colors