Word clouds are used to visualize language and word frequency. A word cloud will use size, color, and position to accent words that are more frequent in a given text. They are a qualitative visualization that can be produced from text analysis.
We will go through the steps used to make a wordcloud using R.
library(readr)
concrete_maxset <- read_csv("~/concrete_maxset_csv.csv")
head(concrete_maxset)
## # A tibble: 6 x 8
## Position Requirements Subjects Actio~ Resou~ Proce~ Conne~ X8
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 <NA> The Project Gut~ <NA> <NA> <NA> <NA> <NA>
## 2 2 of Concrete Constr~ <NA> <NA> <NA> <NA> <NA>
## 3 3 <NA> <NA> <NA> <NA> <NA> [','] <NA>
## 4 4 by Halbert P. Gill~ <NA> <NA> <NA> <NA> <NA>
## 5 4 <NA> This eBook is <NA> <NA> <NA> <NA>
## 6 5 for the use <NA> <NA> <NA> <NA> <NA>
Our data consists of 8 variables: Position, Requirements, Subjects, Actions, Resources, Processes, X8
In this file, the comments of interest are saved in a vector, ‘Subjects’. First this should be converted to a character vector for text mining.
concrete_maxset <- as.character(concrete_maxset)
The ‘tm’ package (text mining) can be used to clean up the text. First, text should be converted to ‘Corpus’, or a list of the text we would like to use. Within each Corpus, the separate comments are considered different documents full of terms.
library(tm) #text mining
library(dplyr)
concrete_maxset.corpus<-Corpus(VectorSource(concrete_maxset))
From here, the terms need to be cleaned up. This can be done using ‘tm_map()’ with various calls that will remove different elements, like punctuation, whitespace and numbers:
concrete_maxset.corpus <- concrete_maxset.corpus%>%
tm_map(removePunctuation)%>% ##eliminate punctuation
tm_map(removeNumbers)%>% #no numbers
tm_map(stripWhitespace)#white spaces
Within each Corpus, the separate comments are considered different documents full of terms.
concrete_maxset.corpus <- concrete_maxset.corpus%>%
tm_map(tolower)%>% ##make all words lowercase
tm_map(removeWords, stopwords("english"))
Or any other words
concrete_maxset.corpus <- tm_map(concrete_maxset.corpus, removeWords, c("the", "and","for","this","that","with","will","also","i'm"))
Text stemming can be used to reduce multiples of the same core word.
concrete_maxset.corpus <- tm_map(concrete_maxset.corpus, stemDocument)
Next, word frequencies can be calculated based on our cleaned text. This will rank them to show what the most used term was.
concrete_maxset.counts<-as.matrix(TermDocumentMatrix(concrete_maxset.corpus))
concrete_maxset.freq<-sort(rowSums(concrete_maxset.counts), decreasing=TRUE)
head(concrete_maxset.freq)##what are the top words?
## concret per form cost work use
## 2714 2052 1683 1486 1333 1039
‘Concret’ is the most used word - used 2,714 times, followed by ‘per’ - used 2,052 times , then ‘form’ used 1,683
Let’s build the wordcloud
library(wordcloud) #wordcloud
## Loading required package: RColorBrewer
set.seed(32) #be sure to set the seed if you want to reproduce the same again
wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(3,.5),max.words = 100, random.order = TRUE)
As we can see, concret, form, use, cost, and work were commonly used. Sand, stone, cement, place, mixer were also used meaning our document talked extensively about building and construction.
We can add color to the wordcloud by using the wesanderson library.
library(wesanderson)
wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(4,.3),max.words = 100,
random.order = TRUE, color=wes_palette("Darjeeling"))
‘random.order’ affects whther frequent words are placed cnetrally or not, and max.words for how many words to include. rot.per sets how many words are rotated.
wordcloud(words=names(concrete_maxset.freq), freq=concrete_maxset.freq, scale=c(4,.3),max.words = 150,
random.order = FALSE, color=wes_palette("Darjeeling"),rot.per=.7)