Wordcloud in R

Word clouds are used to visualize language and word frequency. A word cloud will use size, color, and position to accent words that are more frequent in a given text. They are a qualitative visualization that can be produced from text analysis.

We will go through the steps used to make a wordcloud using R.

First load in the data

library(readr)
concrete_maxset <- read_csv("~/concrete_maxset_csv.csv")

Get a sense of the data we are working with

head(concrete_maxset)

## # A tibble: 6 x 8
##   Position Requirements Subjects         Actio~ Resou~ Proce~ Conne~ X8   
##      <int> <chr>        <chr>            <chr>  <chr>  <chr>  <chr>  <chr>
## 1        1 <NA>         The Project Gut~ <NA>   <NA>   <NA>   <NA>   <NA> 
## 2        2 of           Concrete Constr~ <NA>   <NA>   <NA>   <NA>   <NA> 
## 3        3 <NA>         <NA>             <NA>   <NA>   <NA>   [',']  <NA> 
## 4        4 by           Halbert P. Gill~ <NA>   <NA>   <NA>   <NA>   <NA> 
## 5        4 <NA>         This eBook       is     <NA>   <NA>   <NA>   <NA> 
## 6        5 for          the use          <NA>   <NA>   <NA>   <NA>   <NA>

Our data consists of 8 variables: Position, Requirements, Subjects, Actions, Resources, Processes, X8

Text mining

In this file, the comments of interest are saved in a vector, ‘Subjects’. First this should be converted to a character vector for text mining.

concrete_maxset <- as.character(concrete_maxset)

The ‘tm’ package (text mining) can be used to clean up the text. First, text should be converted to ‘Corpus’, or a list of the text we would like to use. Within each Corpus, the separate comments are considered different documents full of terms.

library(tm) #text mining
library(dplyr)

concrete_maxset.corpus<-Corpus(VectorSource(concrete_maxset))

From here, the terms need to be cleaned up. This can be done using ‘tm_map()’ with various calls that will remove different elements, like punctuation, whitespace and numbers:

concrete_maxset.corpus <- concrete_maxset.corpus%>%
  tm_map(removePunctuation)%>% ##eliminate punctuation
  tm_map(removeNumbers)%>% #no numbers
  tm_map(stripWhitespace)#white spaces

Within each Corpus, the separate comments are considered different documents full of terms.

concrete_maxset.corpus <- concrete_maxset.corpus%>%
  tm_map(tolower)%>% ##make all words lowercase
  tm_map(removeWords, stopwords("english"))

Or any other words

concrete_maxset.corpus <- tm_map(concrete_maxset.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm"))

Text stemming can be used to reduce multiples of the same core word.

concrete_maxset.corpus <- tm_map(concrete_maxset.corpus, stemDocument)

Term Frequencies

Next, word frequencies can be calculated based on our cleaned text. This will rank them to show what the most used term was.

concrete_maxset.counts<-as.matrix(TermDocumentMatrix(concrete_maxset.corpus))
concrete_maxset.freq<-sort(rowSums(concrete_maxset.counts), decreasing=TRUE)
head(concrete_maxset.freq)##what are the top words?

## concret     per    form    cost    work     use 
##    2714    2052    1683    1486    1333    1039

‘Concret’ is the most used word - used 2,714 times, followed by ‘per’ - used 2,052 times , then ‘form’ used 1,683

Word cloud

Let’s build the wordcloud

library(wordcloud) #wordcloud

## Loading required package: RColorBrewer

set.seed(32) #be sure to set the seed if you want to reproduce the same again

wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(3,.5),max.words = 100, random.order = TRUE)

As we can see, concret, form, use, cost, and work were commonly used. Sand, stone, cement, place, mixer were also used meaning our document talked extensively about building and construction.

Color

We can add color to the wordcloud by using the wesanderson library.

library(wesanderson)

wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(4,.3),max.words = 100, 
          random.order = TRUE, color=wes_palette("Darjeeling"))

Costumization

‘random.order’ affects whther frequent words are placed cnetrally or not, and max.words for how many words to include. rot.per sets how many words are rotated.

wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(4,.3),max.words = 150, 
          random.order = FALSE, color=wes_palette("Darjeeling"),rot.per=.7)