word clouds in R

Word clouds are used to visualize language and word frequency. A word cloud will use size, color, and position to accent words that are more frequent in a given text. They are a qualitative visualization that can be produced from text analysis.

I will go through the steps used to make a wordcloud using R. The data used here comes from the comments in a survey. This could be done with any text.

words<-read.csv('words.csv') #read in the data

Text mining

In this file, the comments of interest are saved in a vector, ‘comments’. First this should be converted to a character vector for text mining.

str(words)

##  Factor w/ 962 levels "","\"...A choice, right now, between fear and love. The eyes of fear want you to put bigger locks on your door, buy guns, close yo"| __truncated__,..: 1 1 1 1 1 1 1 356 1 1 ...

words<-as.character(words)

The ‘tm’ package (tet minig) can be used to clean up the text. First, text should be converted to ‘Corpus’, or a list of the text we would like to use. Within each Corpus, the separate comments are considered different documents full of terms.

library(tm) #text mining

## Loading required package: NLP

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

word.corpus<-Corpus(VectorSource(words)) #Corpus

From here, the terms need to be cleaned up. This can be done using ‘tm_map()’ with various calls that will remove different elements, like punctuation, whitespace and numbers:

word.corpus<-word.corpus%>%
  tm_map(removePunctuation)%>% ##eliminate punctuation
  tm_map(removeNumbers)%>% #no numbers
  tm_map(stripWhitespace)#white spaces

It may also be desirable to remove common words like “and” or “the” and other stopwords.

word.corpus<-word.corpus%>%
  tm_map(tolower)%>% ##make all words lowercase
  tm_map(removeWords, stopwords("english"))

Or any other words

word.corpus <- tm_map(word.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm"))

Text stemming can be used to reduce multiples of the same core word.

word.corpus<-tm_map(word.corpus, stemDocument)

Term Frequencies

Next, word frequencies can be calculated based on our cleaned text. This will rank them to show what the most used term was.

word.counts<-as.matrix(TermDocumentMatrix(word.corpus))
word.freq<-sort(rowSums(word.counts), decreasing=TRUE)
head(word.freq)##what are the top words?

## science climate   women  change support funding 
##     272     220     166     141      87      64

Word cloud

library(wordcloud) #wordcloud
set.seed(32) #be sure to set the seed if you want to reproduce the same again

wordcloud(words=names(word.freq), freq=word.freq, scale=c(3,.5),max.words = 100, random.order = TRUE)

These can be customized in many ways. Change the scale to manipulate text size and color.

library(wesanderson)

wordcloud(words=names(word.freq), freq=word.freq, scale=c(4,.3),max.words = 100, 
          random.order = TRUE, color=wes_palette("Darjeeling"))

‘random.order’ affects whther frequent words are placed cnetrally or not, and max.words for how many words to include. rot.per sets how many words are rotated.

wordcloud(words=names(word.freq), freq=word.freq, scale=c(4,.3),max.words = 150, 
          random.order = FALSE, color=wes_palette("Darjeeling"),rot.per=.7)

word clouds in R

collnell

November 20, 2016

Text mining

Term Frequencies

Word cloud