Module 3 Assignment

For this project, we decided to analyze Ambrose Bierce’s The Devil’s Dictionary, a satirical lexicon with over 1,000 definitions. We obtained this text from Project Gutenberg (https://www.gutenberg.org), downloading and saving the text to our computer as a text file (TDD.txt).

Analysis Using R Studio

1. Install & Library Necessary Packages

If you don’t have the packages installed yet, first install “readr”, “wordcloud”, “tm”, and “SnowballC”

Then library these packages.

library(readr)
library(wordcloud)
library(tm)
library(SnowballC)

2. Read Text File

Read the text file into R Studio by first Setting a Working Directory, and then by reading the lines into R.

setwd("~/R")
text <- readLines("TDD.txt")

3. Load Data as a VCorpus

Converting the data to a VCorpus allows us to use the tm package to clean the text.

tddcorp <- VCorpus(VectorSource(text))

You can use the inspect() fucntion to view the text to make sure it uploaded properly. We used inspect(tddcorp).

4. Clean the Text

Replace special characters with a space (" ") using tm_map() function.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
tddcorp <- tm_map(tddcorp, toSpace, "/")
tddcorp <- tm_map(tddcorp, toSpace, "@")
tddcorp <- tm_map(tddcorp, toSpace, "\\|")

Use the tm_map() function to:

  • convert text to lower case
  • remove numbers
  • remove english common stopwords
  • remove your own specified stop word(s)
  • remove punctuations
  • eliminate extra white space
tddcorp <- tm_map(tddcorp, content_transformer(tolower))
tddcorp <- tm_map(tddcorp, removeNumbers)
tddcorp <- tm_map(tddcorp, removeWords, stopwords("english"))
tddcorp <- tm_map(tddcorp, removeWords, c("thy", "electronic", "literary", "first", "now", "laws", "saw", "gutenbergtm", "give", "away", "still", "say", "forth", "every", "used", "till", "two", "another", "known", "ear", "we", "whole", "persons", "several", "might", "erms", "hair", "twas", "says", "however", "something", "things", "shall", "sometimes", "way", "many", "adj", "and", "that", "his", "with", "for", "was", "not", "one", "have", "but", "from", "who", "are", "which", "this", "you", "all", "they", "their", "has", "had", "when", "been", "were", "your", "long", "him", "than", "upon", "most", "person", "some", "our", "other", "great", "work", "there", "word", "adj.", "may", "use", "more", "without", "read", "hear", "ones", "make", "said", "what", "into", "mani", "would", "part", "out", "gutenbergtm", "whose", "them", "onli", "made", "where", "can", "like", "about", "her", "know", "it,", "set", "anoth", "these", "should", "everi", "then", "natur", "it.", "could", "those", "the", "onc", "must", "did", "come", "take", "through", "provid", "someth", "came", "see", "though", "well", "words", "little", "means", "how", "set", "ever", "less", "line", "and,", "much", "even", "look", "nor", "get", "veri", "said,", "let", "thi", "term", "sens", "however,", "littl", "same", "serv", "tis", "yet", "too", "to")) 
tddcorp <- tm_map(tddcorp, removePunctuation)
tddcorp <- tm_map(tddcorp, stripWhitespace)

5. Create a Term Document Matrix (TDM)

A TDM will provide a table listing words from the text and their frequencies.

dtm <- TermDocumentMatrix(tddcorp)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

Use head() function to view the top 10 most frequent words.

head(d, 10)
##                    word freq
## man                 man  137
## project         project   87
## will               will   87
## good               good   77
## day                 day   58
## dead               dead   57
## gutenbergtm gutenbergtm   56
## god                 god   52
## old                 old   51
## time               time   51

6. Create Word Cloud

Create a word cloud based on the importance/frequency of words.

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=40, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

7. Create Bar Plot

To view the information differently, create a bar plot. This plot shows the 30 most frequent words.

barplot(d[1:30,]$freq, las = 2, names.arg = d[1:30,]$word,
        col ="darkred", main ="Most Frequent Words",
        ylab = "Word frequencies", xlab = "Words")