For this project, we decided to analyze Ambrose Bierce’s The Devil’s Dictionary, a satirical lexicon with over 1,000 definitions. We obtained this text from Project Gutenberg (https://www.gutenberg.org), downloading and saving the text to our computer as a text file (TDD.txt).
If you don’t have the packages installed yet, first install “readr”, “wordcloud”, “tm”, and “SnowballC”
Then library these packages.
library(readr)
library(wordcloud)
library(tm)
library(SnowballC)
Read the text file into R Studio by first Setting a Working Directory, and then by reading the lines into R.
setwd("~/R")
text <- readLines("TDD.txt")
Converting the data to a VCorpus allows us to use the tm package to clean the text.
tddcorp <- VCorpus(VectorSource(text))
You can use the inspect() fucntion to view the text to make sure it uploaded properly. We used inspect(tddcorp).
Replace special characters with a space (" ") using tm_map() function.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
tddcorp <- tm_map(tddcorp, toSpace, "/")
tddcorp <- tm_map(tddcorp, toSpace, "@")
tddcorp <- tm_map(tddcorp, toSpace, "\\|")
Use the tm_map() function to:
tddcorp <- tm_map(tddcorp, content_transformer(tolower))
tddcorp <- tm_map(tddcorp, removeNumbers)
tddcorp <- tm_map(tddcorp, removeWords, stopwords("english"))
tddcorp <- tm_map(tddcorp, removeWords, c("thy", "electronic", "literary", "first", "now", "laws", "saw", "gutenbergtm", "give", "away", "still", "say", "forth", "every", "used", "till", "two", "another", "known", "ear", "we", "whole", "persons", "several", "might", "erms", "hair", "twas", "says", "however", "something", "things", "shall", "sometimes", "way", "many", "adj", "and", "that", "his", "with", "for", "was", "not", "one", "have", "but", "from", "who", "are", "which", "this", "you", "all", "they", "their", "has", "had", "when", "been", "were", "your", "long", "him", "than", "upon", "most", "person", "some", "our", "other", "great", "work", "there", "word", "adj.", "may", "use", "more", "without", "read", "hear", "ones", "make", "said", "what", "into", "mani", "would", "part", "out", "gutenbergtm", "whose", "them", "onli", "made", "where", "can", "like", "about", "her", "know", "it,", "set", "anoth", "these", "should", "everi", "then", "natur", "it.", "could", "those", "the", "onc", "must", "did", "come", "take", "through", "provid", "someth", "came", "see", "though", "well", "words", "little", "means", "how", "set", "ever", "less", "line", "and,", "much", "even", "look", "nor", "get", "veri", "said,", "let", "thi", "term", "sens", "however,", "littl", "same", "serv", "tis", "yet", "too", "to"))
tddcorp <- tm_map(tddcorp, removePunctuation)
tddcorp <- tm_map(tddcorp, stripWhitespace)
A TDM will provide a table listing words from the text and their frequencies.
dtm <- TermDocumentMatrix(tddcorp)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
Use head() function to view the top 10 most frequent words.
head(d, 10)
## word freq
## man man 137
## project project 87
## will will 87
## good good 77
## day day 58
## dead dead 57
## gutenbergtm gutenbergtm 56
## god god 52
## old old 51
## time time 51
Create a word cloud based on the importance/frequency of words.
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=40, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
To view the information differently, create a bar plot. This plot shows the 30 most frequent words.
barplot(d[1:30,]$freq, las = 2, names.arg = d[1:30,]$word,
col ="darkred", main ="Most Frequent Words",
ylab = "Word frequencies", xlab = "Words")