During the class we are going to cover a few basic text mining tools in R. Let’s create a directory somewhere on our machine, which is going to be used for our work. Enter the newly created directory and start R from a terminal. (If you are using Windows, then simply start R and set the working directory of the R session to the newly created directory.)
During the class we are going rely mostly on the tm package, but we are going to need a few additional packages too. We can load the necessary packages with the following commands:
(If one of the packages is not installed, it has to be installed first with the install.packages command. E.g. install.packages(“tm”, dependencies=TRUE). In order to install new packages R might need to be started with sudo privileges.)
Note: sometimes when pdf files are read by the functions of the tm package an error occurs claiming that the pdftotext command cannot be found. To fix this we have to ensure that the Xpdf program is installed. (On Windows, adding the program to the PATH environment variable is also necessary, if it is not done automatically.)
Tip: R has a built-in help facility similar to the man command in UNIX. E.g. to get the help for the mean function, the following function can be used:
?meanThe main structure for managing documents in the tm package is called a Corpus, which represents a collection of text documents. A corpus is an abstract concept with several implementations. During the class we are going to use the VCorpus (Volatile Corpus) type, which holds the corpus data fully in memory, that is after the corresponding R objects are deleted, the corpus data is no longer available.
VCorpus variables can be created with the VCorpus(x, readerControl) constructor function, where x denotes a Source variable specifying the location of the input files. There are several types of Source variables, we are going to use the DirSource type, which can be used to specify a directory on the local file system that contains the input files.
The second argument of the VCorpus constructor function is the readerControl argument which controls the data import. The value of the argument is a list, which has two elements in our setup. The reader element of the list is a function that converts the input files into text. Since we use pdf files, we are going to use the readPDF() function of the tm package. The other element of the list is named as language and it specifies the language of the imported text. We set this element to eng.
Now we can import the data with the following command: (make sure the path to the directory containing the input files is set correctly):
source <- DirSource("../../data/vonnegut_books")
vonnegut_books <- VCorpus(x = source, readerControl = list(reader = readPDF(), language = "eng"))We imported the data into the vonnegut_books variable, however, there were no output after the import, so we do not know what is in that variable. str is a useful function for quickly viewing the contents of variables:
str(vonnegut_books) # the output can be huge if there are many input documentsIt can be seen that the variable is a list with 3 elements. All the elements are nested lists with 2 elements called content and meta. The content element contains the text of the input files, while the meta element contains the meta data of the inputs. We can select an element of the list with the “[[index]]” operator, where index is a positive integer or a character string specifying the name of the list element to be selected. Another way to select a list element by its name is to use the $ operator. Therefore, to get information about the first of our imported books we can use the following commands:
str(vonnegut_books[[1]][["content"]])
str(vonnegut_books[[1]]$content)Here are a few additional functions to get various information about variables:
vonnegut_books
length(vonnegut_books)
class(vonnegut_books)
typeof(vonnegut_books)Transformations are done via the tm_map() function in the tm package. The function executes the specified transformation on every element of the corpus, which consists of 3 books in our case. The function is mostly used with two arguments: the first argument specifies the corpus and the second specifies the transformation.
The tm package contains built-in functions for many frequently used text transformations, e.g.:
vonnegut_books <- tm_map(vonnegut_books, stripWhitespace)vonnegut_books <- tm_map(vonnegut_books, content_transformer(tolower))The transformation argument can be any function that returns a TextDocument variable. The content_transformer function is a wrapper function which enables many text manipulation functions to be used as a transformation argument in the tm_map function.
As we saw earlier the input documents contain meta data. Every document has its own metadata set, but there is also a corpus-level metadata set. The corpus-level metadata can contain two different types of metadata: there are metadata that contain a single value pertaining to the whole corpus, and there are also metadata that contain a vector of different values pertaining to the individual documents in the corpus.
The metadata of a corpus can be accessed by select operator of lists as we saw earlier, but we can also use the meta() function.
The following code snippet shows the usage of metadata.
# 1. Metadata of single documents
# the metadata set of the second document:
meta(vonnegut_books[[2]])
# creating new metadata (or update an existing one) for the second document:
# we are creating a new metadata with name "opinion" and value "Very good"
meta(vonnegut_books[[2]], "opinion") <- "Very good"
meta(vonnegut_books[[2]])
# 2. Corpus-level metadata
# 2.1. Corpus-level metadata with different values for each document
meta(vonnegut_books)
meta(vonnegut_books, "rating") <- c(10,8,9)
meta(vonnegut_books)
# 2.2 Corpus-level metadata with a single value
meta(vonnegut_books, type = "corpus")
meta(vonnegut_books, type = "corpus", "megjegyzes") <- "Favourite Vonnegut books"
meta(vonnegut_books, type = "corpus")When dealing with text mining tasks, it is often needed to create the document-term matrix. In the tm package this is done via the DocumentTermMatrix function, and the matrix created by that function can be examined with the inspect function:
docTermMx <- DocumentTermMatrix(vonnegut_books)
docTermMx
inspect(docTermMx[,1000:1010])There are a few functions in the tm package for executing various operations on document-term matrices. E.g.:
findFreqTerms(docTermMx, 50)findAssocs(docTermMx, "time", 0.95)inspect(removeSparseTerms(docTermMx, 0.4)[,1000:1010])
docTermMxReduced <- removeSparseTerms(docTermMx, 0.4)A dictionary is a (multi-)set of strings. It is often used to denote relevant terms in text mining. When we create the document-term matrix we can specify a dictionary to restrict the set of terms to be used in the matrix, i.e. only terms from the dictionary will appear in the matrix:
dictionary <- c("she", "you")
docTermMxWithDictionary <- DocumentTermMatrix(vonnegut_books, list(dictionary = dictionary))
inspect(docTermMxWithDictionary)The class of our document-term matrix variable is DocumentTermMatrix which is defined in the tm package. This class information is used by the tm package and makes it easy to perform certain operations on the matrices, however, it also makes other basic R operations much more difficult to perform. To overcome this issue, it is worth sometimes to convert our DocumentTermMatrix variable into a data frame in which rows represent the documents and the columns represent the terms. The following code snippet creates a data frame from the document-term matrix and displays a few properties of the data frame:
docTermDf <- as.data.frame(as.matrix(docTermMxReduced))
ncol(docTermDf) # number of columns
nrow(docTermDf) # number of rows
names(docTermDf) # names of columns
head(docTermDf) # names of columns together with the first few rowsSince we now have our document-term matrix available as a data frame we can easily perform various analyses on it.
# find out which line contains the data of the book Slaughterhouse Five
rownames(docTermDf) # it can be seen from the output that it is the second line
tmp <- as.numeric(docTermDf[2, ]) # selecting the second line and converting it to numbers
plot(density(tmp), xlab="Frequency")It can be seen that a few terms occur very frequently, however, most of the terms occur only a very few times. Let’s see a few descriptive statistics of the distribution. By using the summary function we can get the mean, median, minimum, maximum and quartiles of a numeric vector.
summary(tmp)The most frequent term occured 2770 times, while the mean of the frequencies of the collected terms is 5.0428069.
barplot(docTermDf[,"time"], col=rainbow(3))
legend("topright", rownames(docTermDf), fill=rainbow(3))Here we have our data in one csv file in which every row has one element: a tweet. We can import the data with the following code:
tweetsDf <- read.csv("../../data/tweets/tweets.csv",
stringsAsFactors=FALSE, header=TRUE,
quote="", colClasses=c("character"),
sep="\n")
tweetsDf$text <- iconv(tweetsDf$text,"WINDOWS-1252","UTF-8")
sourceTwitter <- DataframeSource(tweetsDf)
tweets <- VCorpus(sourceTwitter, readerControl = list(language = "eng"))How many documents are in the corpus?
Convert the documents of the corpus to lower case.
Perform stemming on the documents.
Help: tm_map and stemDocument
Note: the name of the applied stemming algorithm is Porter stemming algorithm, for more information go to *http://snowball.tartarus.org/algorithms/porter/stemmer.html*.
Help: tm_map and removeWords and stopwords(“eng”). The third argument of the the tm_map function is automatically passed over to the function specified in the second argument, when it is invoked.
Remove extra whitespaces from the document!
Create a document-term matrix for the corpus! How many terms are in the matrix?
Create another document-term matrix for the corpus containing only the terms that occur at least 20 times in the corpus! How many terms are in the matrix?
Help: create a dictionary with the findFreqTerms() function and then use the dictionary to build the document-term matrix
Create a data frame from the document-term matrix of the previous task.
Plot the distribution of the sum of term occurences for each document, but taking into account only the terms of the document-term matrix.
Help: use the rowSums and plot(density(…)) functions. Create a new column in the data frame containing the sum of the frequencies of the terms of the document-term matrix. A new column can be created by a simple value assignment like data_frame_variable$new_column_name <- vector_containing_the_values_of_the_new_column
R tutorial: http://cran.r-project.org/doc/manuals/r-release/R-intro.html
tm package: http://cran.r-project.org/web/packages/tm/index.html
Introduction to the tm package (this is also the basis of this class): http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
R tools for NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
The collection of tweets was done via the following code snippet. It uses the Twitter API so an authentication has to be performed to execute the API calls. To perform such authentication one has to register a Twitter account and a Twitter app.
library(twitteR)
api_key <- "OWN API KEY"
api_secret <- "OWN API SECRET"
access_token <- "OWN ACCESS TOKEN"
access_token_secret <- "OWN ACCESS TOKEN SECRET"
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
tweetsFromTwitter <- userTimeline("rbloggers", n=10000)
tweetsFromTwitter <- c(tweetsFromTwitter, userTimeline("rdatamining", n=10000))
tweetsFromTwitter <- c(tweetsFromTwitter, userTimeline("rogerfederer", n=10000))
tweetsFromTwitter <- c(tweetsFromTwitter, userTimeline("humansofny", n=10000))
tweetsFromTwitter <- c(tweetsFromTwitter, userTimeline("taylorswift13", n=10000))
tweetsFromTwitterDf <- do.call("rbind", lapply(tweetsFromTwitter, as.data.frame))
tweetsFromTwitterDf$text <- gsub("\n", " ", tweetsFromTwitterDf$text)
write.table(
iconv(tweetsFromTwitterDf$text,"WINDOWS-1252","UTF-8"),
"../../data/tweets/tweets.csv",
row.names=FALSE,
col.names="text",
quote=FALSE)