Text Mining Class

About R

Functional programming language and environment
Originally for statistical computations and graphical visualizations
Easily extensible, tons of packages
Command line interface
Free. Can be downloaded from *http://www.r-project.org/*

Starting R

Create a working directory
Start a terminal, enter the wd and start R
After start, load the necessary packages:

library(tm)
library(SnowballC)

We are going to use the tm package mostly
Help: pl.: ?mean

?mean

Data Import

tm works on Corpus variables
- Corpus is an abstract base class
- A Corpus consists of documents
- Documents ~ entities (here: Vonnegut books)
Constructor e.g.: VCorpus(x, readerControl)
Books in pdf files, so the Corpus can be created like:

source <- DirSource("../../data/vonnegut_books")
vonnegut_books <- VCorpus(x = source,
        readerControl =
                list(reader = readPDF(),
                language = "eng"))

If we want to use different sources than local directories, then the getSources() function can be used to list the available Source types.

Viewing the Data

What was imported?
The variable contains 3 whole books -> huge amount of data
Few possibilities to discover/look at the data:

str(vonnegut_books) # output can be huge
str(vonnegut_books[[1]][["content"]])
str(vonnegut_books[[1]]$content)
vonnegut_books
length(vonnegut_books)
class(vonnegut_books)
typeof(vonnegut_books)

Additionally: the inspect function of the tm package

Data Transformation

via the tm_map function: two arguments:
1. corpus
2. transformation function
The tm package includes functions for frequently used text transformations
E.g.: eliminating extra whitespace and then converting to lower case

vonnegut_books <- tm_map(vonnegut_books, stripWhitespace)
vonnegut_books <- tm_map(vonnegut_books,
                       content_transformer(tolower))

Metadata I.

Different types of metadata exist in the tm package
Types:

Document-level metadata: pertains to single documents. Not just the values, but even the set of the metadata names can be different for the documents. E.g. one of the document can have a “rating” metadata while others do not
Corpus-level metadata
- Metadata that can have different values for each document.
- Metadata containing a single value that pertains to the whole corpus. Name-value pairs.

Metadata II.

Handling metadata: via the meta function
E.g.: the document-level metadata of the second document:

meta(vonnegut_books[[2]], "opinion") <-
        "Very good"
meta(vonnegut_books[[2]])

##   author       : Andrew Conley
##   datetimestamp: 2002-07-28 11:21:06
##   description  : character(0)
##   heading      : Microsoft Word - Document1
##   id           : Kurt_Vonnegut-Slaughterhouse_Five.pdf
##   language     : eng
##   origin       : PScript5.dll Version 5.2
##   opinion      : Very good

Document-Term Matrix I. - Creation

docTermMx <- DocumentTermMatrix(vonnegut_books)
inspect(docTermMx[,1000:1001])

## <<DocumentTermMatrix (documents: 3, terms: 2)>>
## Non-/sparse entries: 2/4
## Sparsity           : 67%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## 
##                                        Terms
## Docs                                    'you're 'you've
##   Kurt_Vonnegut-Bluebeard.pdf                 0       0
##   Kurt_Vonnegut-Slaughterhouse_Five.pdf       7       1
##   Kurt_Vonnegut-The_Sirens_of_Titan.pdf       0       0

Document-Term Matrix II. - Operations

Finding terms that occured at least 50 times:

findFreqTerms(docTermMx, 50)

Finding terms that have a correlation with a specified term (“time”) greater than or equal to 0.95:

findAssocs(docTermMx, "time", 0.95)

Eliminating rare terms: (keeping terms that occur at least in 40% of the documents)

docTermMxReduced <- removeSparseTerms(docTermMx, 0.4)

Dictionaries

Dictionary: set of words
Creating document-term matrix from dictionary:

dictionary <- c("she", "you")
docTermMxWithDictionary <- DocumentTermMatrix(vonnegut_books,
                        list(dictionary = dictionary))
as.matrix(docTermMxWithDictionary)

##                                        Terms
## Docs                                    she you
##   Kurt_Vonnegut-Bluebeard.pdf           804 414
##   Kurt_Vonnegut-Slaughterhouse_Five.pdf 198 141
##   Kurt_Vonnegut-The_Sirens_of_Titan.pdf 196 507

Converting to Data Frame

Data frame: the most frequently used data structure in R
Not a base type, stored as list
Matrix-like object, however the types of columns can be different
Converting a document-term matrix to a data frame:

docTermDf <- as.data.frame(as.matrix(docTermMxReduced))
ncol(docTermDf)  # number of columns
nrow(docTermDf)  # number of rows
names(docTermDf) # names of columns
head(docTermDf)  # names of columns and first rows

Simple Operations with Data Frames I.

The distribution of the collected terms in Slaughterhouse Five:

# which line pertains to Slaughterhouse Five?
rownames(docTermDf)

## [1] "Kurt_Vonnegut-Bluebeard.pdf"          
## [2] "Kurt_Vonnegut-Slaughterhouse_Five.pdf"
## [3] "Kurt_Vonnegut-The_Sirens_of_Titan.pdf"

Simple Operations with Data Frames II.

The distribution of the collected terms in Slaughterhouse Five:

# selecting the row and converting to numeric vector
tmp <- as.numeric(docTermDf[2, ])
plot(density(tmp), xlab="Frequency")

Simple Operations with Data Frames III.

Create a bar plot to represent the frequency of the word “time” in the books of our corpus:

barplot(docTermDf[,"time"], col=rainbow(3))
legend("topright", rownames(docTermDf), fill=rainbow(3))

Tasks

Tip: it is useful to collect the final R commands into a text file to be able to reproduce the analysis later.

R commands can be executed from a text file by the source() function. E.g.:

source("file_path/to_source_file/solutions.r")

Text Mining Class

Gergely Pósfai

15-11-2015

About R

Starting R

Data Import

Viewing the Data

Data Transformation

Metadata I.

Metadata II.

Document-Term Matrix I. - Creation

Document-Term Matrix II. - Operations

Dictionaries

Converting to Data Frame

Simple Operations with Data Frames I.

Simple Operations with Data Frames II.

Simple Operations with Data Frames III.

Tasks