Load the library that are required in the assignment:
library("tm")
library("SnowballC")
library("wordcloud")
Here in this report we are going to create a word cloud for a given document, basically in word cloud we show the word with higher frequency with bigger size and the words with lower frequency with smaller size
In this report we are going to generate a word cloud from the PDF document
our corpus contain only PDF file so before loading the corpus into memory it require to conver that into the txt file. create a directory name corpus inside this directory create another directory name pdf so full path is like this our .R file and our corpus directory are in the same directory and pdf document is in corpus/pdf/*.pdf
file_path = file.path(".","corpus","pdf")
you can list all the files present in the directory by using the dir function
dir(file_path)
## [1] "ausdm07.pdf"
## [2] "eJHI06.pdf"
## [3] "hdm05.pdf"
## [4] "jeff.pdf"
## [5] "miningmodels.pdf"
## [6] "performance-modeling-message.pdf"
## [7] "probability_cheatsheet.pdf"
## [8] "RJournal_2009-1_Guazzelli+et+al.pdf"
## [9] "RJournal_2009-2_Williams.pdf"
## [10] "story.pdf"
Load the files from the directory and make the corpus, as all of the file present here are PDF file so for that we use reader as readPDF
myCorpus <-Corpus(DirSource(file_path), readerControl = list(reader = readPDF ))
let's view our corpus
myCorpus[1]
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
Now apply some preprocessing to this corpus to create word cloud
To deal with text data following pre-processing is required.
Follow the standard steps to build and pre-process the corpus:
1) Build a new corpus variable called corpus.
2) Using tm_map, convert the text to lowercase.
3) Using tm_map, remove all punctuation from the corpus.
4) Using tm_map, remove all English stopwords from the corpus.
5) Using tm_map, stem the words in the corpus.
6) Build a document term matrix from the corpus, called dtm.
Each operation, like stemming or removing stop words, can be done with one line in R,
where we use the tm_map() function which takes as
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
myCorpus <- tm_map(myCorpus, toSpace, "/|@|\\|")
myCorpus <- tm_map(myCorpus,content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
some of the words which are not present in the stop words list but we are required to remove them because these words no carry any meaning in our document
myCorpus <- tm_map(myCorpus, removeWords, c("one","let","set","prove","path","use","case","follow","number"))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removeNumbers)
stem the document for stemming the document we are using the SnowballC package from cran
myCorpus <-tm_map(myCorpus,stemDocument)
Now our corpus is ready to apply our algorithm for creating the corpus, we first conver this to document term matrix to create document term matrix R provied a function DocumentTermMatrix
myCorpusDTM <- DocumentTermMatrix(myCorpus)
Let's inspect DTM
inspect(myCorpusDTM[1:4,100:106])
## <<DocumentTermMatrix (documents: 4, terms: 7)>>
## Non-/sparse entries: 6/22
## Sparsity : 79%
## Maximal term length: 13
## Weighting : term frequency (tf)
##
## Terms
## Docs advoc affect aerophobesrus aex afa afflatus affect
## ausdm07.pdf 0 0 0 0 0 0 0
## eJHI06.pdf 0 0 0 0 0 0 1
## hdm05.pdf 0 1 0 0 1 0 0
## jeff.pdf 0 0 1 0 0 1 4
findFreqTerms(myCorpusDTM, lowfreq=100)
## [1] "actual" "algorithm" "also" "alway" "amort"
## [6] "analysi" "analyz" "anoth" "answer" "approxim"
## [11] "arbitrari" "array" "assign" "assum" "base"
## [16] "binari" "bit" "bolt" "bound" "call"
## [21] "can" "capac" "case" "chang" "choos"
## [26] "class" "cluster" "color" "common" "compon"
## [31] "comput" "connect" "consid" "constant" "contain"
## [36] "correct" "cost" "cover" "cut" "cycl"
## [41] "data" "dataset" "decis" "defin" "definit"
## [46] "denot" "depth" "describ" "determin" "develop"
## [51] "differ" "direct" "distanc" "distribut" "dont"
## [56] "dynam" "edg" "edit" "efficient" "either"
## [61] "element" "els" "end" "equal" "even"
## [66] "event" "everi" "exact" "exampl" "expect"
## [71] "fact" "feasibl" "follow" "form" "formula"
## [76] "function" "game" "general" "get" "give"
## [81] "given" "graph" "greedi" "hash" "hint"
## [86] "http" "impli" "includ" "increas" "independ"
## [91] "indic" "induct" "input" "insert" "integ"
## [96] "interest" "item" "jeff" "just" "key"
## [101] "know" "larg" "largest" "least" "lectur"
## [106] "length" "level" "licens" "like" "line"
## [111] "linear" "list" "log" "look" "make"
## [116] "mani" "map" "match" "maximum" "may"
## [121] "mean" "method" "might" "mine" "minimum"
## [126] "model" "move" "multipl" "must" "find"
## [131] "need" "network" "new" "node" "note"
## [136] "now" "nphard" "number" "oper" "optim"
## [141] "order" "origin" "flow" "pair" "particular"
## [146] "path" "pattern" "perform" "pmml" "point"
## [151] "polynomi" "popul" "posit" "possibl" "prioriti"
## [156] "probabl" "problem" "program" "proof" "random"
## [161] "rank" "rattl" "recurr" "recurs" "reduct"
## [166] "repres" "requir" "result" "return" "right"
## [171] "root" "first" "run" "search" "see"
## [176] "sequenc" "set" "shortest" "show" "simpl"
## [181] "sinc" "singl" "size" "small" "smallest"
## [186] "solut" "solv" "sort" "span" "start"
## [191] "step" "strong" "structur" "subset" "sum"
## [196] "suppos" "system" "tabl" "take" "target"
## [201] "term" "theorem" "three" "thus" "time"
## [206] "total" "transform" "tree" "true" "two"
## [211] "use" "valu" "variabl" "vector" "vertex"
## [216] "vertic" "want" "way" "weight" "whether"
## [221] "will" "word" "work" "worst"
find association in data
# findAssocs(myCorpusDTM, "data", corlimit=0.6)
freq <- sort(colSums(as.matrix(myCorpusDTM)), decreasing=TRUE)
Now conver myCorpusDTM to the matrix
fmatrixtdm <- as.matrix(myCorpusDTM)
Write this to CSV file
# write.csv(fmatrixtdm,file = "myCorpusDTM.csv")
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
wordcloud(names(freq), freq, scale=c(6,0.7), max.words=150, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8,"Dark2"))