Word clouds are used to visualize language and word frequency. A word cloud will use size, color, and position to accent words that are more frequent in a given text. They are a qualitative visualization that can be produced from text analysis.
I will go through the steps used to make a wordcloud using R. The data used here comes from the comments in a survey. This could be done with any text.
words<-read.csv('words.csv') #read in the data
In this file, the comments of interest are saved in a vector, ‘comments’. First this should be converted to a character vector for text mining.
str(words)
## Factor w/ 962 levels "","\"...A choice, right now, between fear and love. The eyes of fear want you to put bigger locks on your door, buy guns, close yo"| __truncated__,..: 1 1 1 1 1 1 1 356 1 1 ...
words<-as.character(words)
The ‘tm’ package (tet minig) can be used to clean up the text. First, text should be converted to ‘Corpus’, or a list of the text we would like to use. Within each Corpus, the separate comments are considered different documents full of terms.
library(tm) #text mining
## Loading required package: NLP
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
word.corpus<-Corpus(VectorSource(words)) #Corpus
From here, the terms need to be cleaned up. This can be done using ‘tm_map()’ with various calls that will remove different elements, like punctuation, whitespace and numbers:
word.corpus<-word.corpus%>%
tm_map(removePunctuation)%>% ##eliminate punctuation
tm_map(removeNumbers)%>% #no numbers
tm_map(stripWhitespace)#white spaces
It may also be desirable to remove common words like “and” or “the” and other stopwords.
word.corpus<-word.corpus%>%
tm_map(tolower)%>% ##make all words lowercase
tm_map(removeWords, stopwords("english"))
Or any other words
word.corpus <- tm_map(word.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm"))
Text stemming can be used to reduce multiples of the same core word.
word.corpus<-tm_map(word.corpus, stemDocument)
Next, word frequencies can be calculated based on our cleaned text. This will rank them to show what the most used term was.
word.counts<-as.matrix(TermDocumentMatrix(word.corpus))
word.freq<-sort(rowSums(word.counts), decreasing=TRUE)
head(word.freq)##what are the top words?
## science climate women change support funding
## 272 220 166 141 87 64
library(wordcloud) #wordcloud
set.seed(32) #be sure to set the seed if you want to reproduce the same again
wordcloud(words=names(word.freq), freq=word.freq, scale=c(3,.5),max.words = 100, random.order = TRUE)
These can be customized in many ways. Change the scale to manipulate text size and color.
library(wesanderson)
wordcloud(words=names(word.freq), freq=word.freq, scale=c(4,.3),max.words = 100,
random.order = TRUE, color=wes_palette("Darjeeling"))
‘random.order’ affects whther frequent words are placed cnetrally or not, and max.words for how many words to include. rot.per sets how many words are rotated.
wordcloud(words=names(word.freq), freq=word.freq, scale=c(4,.3),max.words = 150,
random.order = FALSE, color=wes_palette("Darjeeling"),rot.per=.7)