Modeling & NLP Test

Preparing the data

We first read the text file, tokenize each line / document to extract the tokens. To simplify the treatment I ignore the token frequency.

Set ndoc to -1 to use the full data set.

ndoc = -1
raw <- scan(file = "input.txt", what = list(x = ""), sep = "\n", nlines = ndoc)$x
bow <- sapply(raw, function(x) {
    unique(unlist(strsplit(x, fixed = F, split = "\\W+")))
})
ndoc = length(bow)  ## need to update this in case it was set to -1
names(bow) <- paste("doc", 1:length(bow), sep = "")
rm(raw)

We extract the vocabulary as the list of unique tokens.

vocabulary <- sort(unique(unlist(bow)))

We index the documents. Each document in bow (stands for bag-of-words) is represented by a list of indices of the terms present in the document. The index refers to the position of the token in the vocabulary

bow <- sapply(bow, function(x) {
    factor(x, levels = vocabulary)
})
labels = rep(c("POS", "NEG"), each = ndoc/2)
nlabels = table(labels)
# Use this code instead to read the labels from file labels =
# scan('labels.txt', nlines=ndoc) labels[labels==1] = 'POS'
# labels[labels==0] = 'NEG'

Building the Classifier

We will use simple a simple Naive Bayes classifier. See [http://en.wikipedia.org/wiki/Naive_Bayes_classifier]. This is a particularly simple model, but there are good theoretical reasons to use it. See Domingos, Pedro; Pazzani, Michael (1997). “On the optimality of the simple Bayesian classifier under zero-one loss”. Machine Learning 29: 103-137, for a good explanation why. The Naive Bayes approach is also a good baseline.

We first compute the frequency of the tokens in each classes. We add Laplacian smoothing to avoid 0 probabilities.

freqs = tapply(bow, labels, function(x) {
    table(unlist(x))
})
freqs = sapply(freqs, "+", 1)/(1 + as.numeric(nlabels[names(freqs)]))

We compute the probability of one document to belong to the class. We work in the logspace to avoid numerical underflows.

class.prior = log(nlabels/sum(nlabels))
log.odds <- function(bow) {
    index <- rep(FALSE, length(vocabulary))
    index[as.numeric(bow)] <- TRUE
    ll = apply(freqs, 2, function(x) {
        sum(log(ifelse(index, x, 1 - x)))
    })
    ll = ll + class.prior[names(ll)]
    odd = ll["POS"] - ll["NEG"]
}
pos.log.odds <- sapply(bow, log.odds)

We check graphically that the results make sense:

boxplot(pos.log.odds ~ labels)

plot of chunk bp

We can recover the probability that a document belongs to the “POS” class as follows:

p = 1/(1/exp(pos.log.odds) + 1)
summary(p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.424   0.499   1.000   1.000

These probabilities reflects the confidence a document falls into a class. The naive bayes assumptions

Testing

I am not demonstrating any testing here for lack of time. Instead I summarize how I would do it. The data set is small so we can consider using Jackknife resampling [http://en.wikipedia.org/wiki/Jackknife_resampling] or the Bootstrap [http://en.wikipedia.org/wiki/Bootstrap_(statistics)]. Another possibility is to cross-validate. The objective of these methods is to verify we are not overfitting the training data, which would lead to poor generalization on new documents.

A possible implementation of these methods in R can be found here: [http://www.r-bloggers.com/functional-programming-in-r/] under the Closures heading.

Document Representation: Potential Improvement

Other Classifier Methods