library(tm)
library(SnowballC)
?mean
source <- DirSource("../../data/vonnegut_books")
vonnegut_books <- VCorpus(x = source,
readerControl =
list(reader = readPDF(),
language = "eng"))
str(vonnegut_books) # output can be huge
str(vonnegut_books[[1]][["content"]])
str(vonnegut_books[[1]]$content)
vonnegut_books
length(vonnegut_books)
class(vonnegut_books)
typeof(vonnegut_books)
The tm package includes functions for frequently used text transformations
E.g.: eliminating extra whitespace and then converting to lower case
vonnegut_books <- tm_map(vonnegut_books, stripWhitespace)
vonnegut_books <- tm_map(vonnegut_books,
content_transformer(tolower))
Document-level metadata: pertains to single documents. Not just the values, but even the set of the metadata names can be different for the documents. E.g. one of the document can have a “rating” metadata while others do not
meta(vonnegut_books[[2]], "opinion") <-
"Very good"
meta(vonnegut_books[[2]])
## author : Andrew Conley
## datetimestamp: 2002-07-28 11:21:06
## description : character(0)
## heading : Microsoft Word - Document1
## id : Kurt_Vonnegut-Slaughterhouse_Five.pdf
## language : eng
## origin : PScript5.dll Version 5.2
## opinion : Very good
docTermMx <- DocumentTermMatrix(vonnegut_books)
inspect(docTermMx[,1000:1001])
## <<DocumentTermMatrix (documents: 3, terms: 2)>>
## Non-/sparse entries: 2/4
## Sparsity : 67%
## Maximal term length: 7
## Weighting : term frequency (tf)
##
## Terms
## Docs 'you're 'you've
## Kurt_Vonnegut-Bluebeard.pdf 0 0
## Kurt_Vonnegut-Slaughterhouse_Five.pdf 7 1
## Kurt_Vonnegut-The_Sirens_of_Titan.pdf 0 0
findFreqTerms(docTermMx, 50)
findAssocs(docTermMx, "time", 0.95)
docTermMxReduced <- removeSparseTerms(docTermMx, 0.4)
Dictionary: set of words
Creating document-term matrix from dictionary:
dictionary <- c("she", "you")
docTermMxWithDictionary <- DocumentTermMatrix(vonnegut_books,
list(dictionary = dictionary))
as.matrix(docTermMxWithDictionary)
## Terms
## Docs she you
## Kurt_Vonnegut-Bluebeard.pdf 804 414
## Kurt_Vonnegut-Slaughterhouse_Five.pdf 198 141
## Kurt_Vonnegut-The_Sirens_of_Titan.pdf 196 507
Data frame: the most frequently used data structure in R
Not a base type, stored as list
Matrix-like object, however the types of columns can be different
Converting a document-term matrix to a data frame:
docTermDf <- as.data.frame(as.matrix(docTermMxReduced))
ncol(docTermDf) # number of columns
nrow(docTermDf) # number of rows
names(docTermDf) # names of columns
head(docTermDf) # names of columns and first rows
# which line pertains to Slaughterhouse Five?
rownames(docTermDf)
## [1] "Kurt_Vonnegut-Bluebeard.pdf"
## [2] "Kurt_Vonnegut-Slaughterhouse_Five.pdf"
## [3] "Kurt_Vonnegut-The_Sirens_of_Titan.pdf"
# selecting the row and converting to numeric vector
tmp <- as.numeric(docTermDf[2, ])
plot(density(tmp), xlab="Frequency")
barplot(docTermDf[,"time"], col=rainbow(3))
legend("topright", rownames(docTermDf), fill=rainbow(3))
Tip: it is useful to collect the final R commands into a text file to be able to reproduce the analysis later.
R commands can be executed from a text file by the source() function. E.g.:
source("file_path/to_source_file/solutions.r")