April 26, 2019

Getting started

Creating a wordcloud with R is easy.

But first, you will need a few packages:

  • tm
  • NLP
  • wordcloud

and

  • readr to read the data

Tools

Once you have those package installed, just load them.

library(tm)
library(NLP)
library(wordcloud)
library(readr)

Text

nms <- read_lines(
  "data/webdb_niagara_movement_speech.txt")

Cleaning & Processing

An important step is the text cleaning and transformation process:

# Load the text as corpus
nmsCorpus <- Corpus(VectorSource(nms))

# cleaning the text with tm_map
nmsCorpus <- tm_map(nmsCorpus, tolower)
nmsCorpus <- tm_map(nmsCorpus, removePunctuation)
nmsCorpus <- tm_map(nmsCorpus,removeWords, 
                    stopwords('english'))
nmsCorpus <- tm_map(nmsCorpus, removeNumbers)

# Build a term document matrix
nms_dtm <- TermDocumentMatrix(nmsCorpus)
nms_dtm_matrix <- as.matrix(nms_dtm)

# Finding the word frequencies by adding the "1s" 
# in the rows of the tdm
v <- sort(rowSums(nms_dtm_matrix), decreasing=TRUE)

# Turning the matrix into a dataframe
d <- data.frame(word = names(v),freq=v)

Coding the wordcloud

Now, we’re ready to create our wordcloud:

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 2,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Plotting the wordcloud

Now let’s take a look:
Wordcloud of W.E.B. Dubois's Niagara Movement Speech

Wordcloud of W.E.B. Dubois’s Niagara Movement Speech

Analyzing

Let’s find the most frequent words:

findFreqTerms(nms_dtm, lowfreq = 4)
##  [1] "black"          "discrimination" "men"            "simply"        
##  [5] "work"           "manhood"        "right"          "white"         
##  [9] "will"           "want"           "race"           "south"         
## [13] "education"      "john"           "violence"

Analyzing

And the most common associations:

findAssocs(nms_dtm, terms = "work", corlimit = 0.65)
## $work
##        actually          afraid             ask           bread        brethren 
##            0.77            0.77            0.77            0.77            0.77 
##         capital        citizens          coming           daily       decencies 
##            0.77            0.77            0.77            0.77            0.77 
##       defenders         earning           fifty      flourished            hard 
##            0.77            0.77            0.77            0.77            0.77 
##           hater         hearing          moment            name        nation’s 
##            0.77            0.77            0.77            0.77            0.77 
##        ordinary         pausing      progressed representatives       retreated 
##            0.77            0.77            0.77            0.77            0.77 
##          spread        stealing          stolen         thunder            toil 
##            0.77            0.77            0.77            0.77            0.77 
##          travel            turn          weaker      whispering            year 
##            0.77            0.77            0.77            0.77            0.77 
##          year’s            step 
##            0.77            0.67

Plotting

And create a barplot: