About R

Starting R

library(tm)
library(SnowballC)
?mean

Data Import

source <- DirSource("../../data/vonnegut_books")
vonnegut_books <- VCorpus(x = source,
        readerControl =
                list(reader = readPDF(),
                language = "eng"))

Viewing the Data

str(vonnegut_books) # output can be huge
str(vonnegut_books[[1]][["content"]])
str(vonnegut_books[[1]]$content)
vonnegut_books
length(vonnegut_books)
class(vonnegut_books)
typeof(vonnegut_books)

Data Transformation

vonnegut_books <- tm_map(vonnegut_books, stripWhitespace)
vonnegut_books <- tm_map(vonnegut_books,
                       content_transformer(tolower))

Metadata I.

  1. Document-level metadata: pertains to single documents. Not just the values, but even the set of the metadata names can be different for the documents. E.g. one of the document can have a “rating” metadata while others do not

  2. Corpus-level metadata
    • Metadata that can have different values for each document.
    • Metadata containing a single value that pertains to the whole corpus. Name-value pairs.

Metadata II.

meta(vonnegut_books[[2]], "opinion") <-
        "Very good"
meta(vonnegut_books[[2]])
##   author       : Andrew Conley
##   datetimestamp: 2002-07-28 11:21:06
##   description  : character(0)
##   heading      : Microsoft Word - Document1
##   id           : Kurt_Vonnegut-Slaughterhouse_Five.pdf
##   language     : eng
##   origin       : PScript5.dll Version 5.2
##   opinion      : Very good

Document-Term Matrix I. - Creation

docTermMx <- DocumentTermMatrix(vonnegut_books)
inspect(docTermMx[,1000:1001])
## <<DocumentTermMatrix (documents: 3, terms: 2)>>
## Non-/sparse entries: 2/4
## Sparsity           : 67%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## 
##                                        Terms
## Docs                                    'you're 'you've
##   Kurt_Vonnegut-Bluebeard.pdf                 0       0
##   Kurt_Vonnegut-Slaughterhouse_Five.pdf       7       1
##   Kurt_Vonnegut-The_Sirens_of_Titan.pdf       0       0

Document-Term Matrix II. - Operations

findFreqTerms(docTermMx, 50)
findAssocs(docTermMx, "time", 0.95)
docTermMxReduced <- removeSparseTerms(docTermMx, 0.4)

Dictionaries

dictionary <- c("she", "you")
docTermMxWithDictionary <- DocumentTermMatrix(vonnegut_books,
                        list(dictionary = dictionary))
as.matrix(docTermMxWithDictionary)
##                                        Terms
## Docs                                    she you
##   Kurt_Vonnegut-Bluebeard.pdf           804 414
##   Kurt_Vonnegut-Slaughterhouse_Five.pdf 198 141
##   Kurt_Vonnegut-The_Sirens_of_Titan.pdf 196 507

Converting to Data Frame

docTermDf <- as.data.frame(as.matrix(docTermMxReduced))
ncol(docTermDf)  # number of columns
nrow(docTermDf)  # number of rows
names(docTermDf) # names of columns
head(docTermDf)  # names of columns and first rows

Simple Operations with Data Frames I.

# which line pertains to Slaughterhouse Five?
rownames(docTermDf)
## [1] "Kurt_Vonnegut-Bluebeard.pdf"          
## [2] "Kurt_Vonnegut-Slaughterhouse_Five.pdf"
## [3] "Kurt_Vonnegut-The_Sirens_of_Titan.pdf"

Simple Operations with Data Frames II.

# selecting the row and converting to numeric vector
tmp <- as.numeric(docTermDf[2, ])
plot(density(tmp), xlab="Frequency")

Simple Operations with Data Frames III.

barplot(docTermDf[,"time"], col=rainbow(3))
legend("topright", rownames(docTermDf), fill=rainbow(3))

Tasks

Tip: it is useful to collect the final R commands into a text file to be able to reproduce the analysis later.

R commands can be executed from a text file by the source() function. E.g.:

source("file_path/to_source_file/solutions.r")