We first read the text file, tokenize each line / document to extract the tokens. To simplify the treatment I ignore the token frequency.
Set ndoc to -1 to use the full data set.
ndoc = -1
raw <- scan(file = "input.txt", what = list(x = ""), sep = "\n", nlines = ndoc)$x
bow <- sapply(raw, function(x) {
unique(unlist(strsplit(x, fixed = F, split = "\\W+")))
})
ndoc = length(bow) ## need to update this in case it was set to -1
names(bow) <- paste("doc", 1:length(bow), sep = "")
rm(raw)
We extract the vocabulary as the list of unique tokens.
vocabulary <- sort(unique(unlist(bow)))
We index the documents. Each document in bow (stands for bag-of-words) is represented by a list of indices of the terms present in the document. The index refers to the position of the token in the vocabulary
bow <- sapply(bow, function(x) {
factor(x, levels = vocabulary)
})
labels = rep(c("POS", "NEG"), each = ndoc/2)
nlabels = table(labels)
# Use this code instead to read the labels from file labels =
# scan('labels.txt', nlines=ndoc) labels[labels==1] = 'POS'
# labels[labels==0] = 'NEG'
We will use simple a simple Naive Bayes classifier. See [http://en.wikipedia.org/wiki/Naive_Bayes_classifier]. This is a particularly simple model, but there are good theoretical reasons to use it. See Domingos, Pedro; Pazzani, Michael (1997). “On the optimality of the simple Bayesian classifier under zero-one loss”. Machine Learning 29: 103-137, for a good explanation why. The Naive Bayes approach is also a good baseline.
We first compute the frequency of the tokens in each classes. We add Laplacian smoothing to avoid 0 probabilities.
freqs = tapply(bow, labels, function(x) {
table(unlist(x))
})
freqs = sapply(freqs, "+", 1)/(1 + as.numeric(nlabels[names(freqs)]))
We compute the probability of one document to belong to the class. We work in the logspace to avoid numerical underflows.
class.prior = log(nlabels/sum(nlabels))
log.odds <- function(bow) {
index <- rep(FALSE, length(vocabulary))
index[as.numeric(bow)] <- TRUE
ll = apply(freqs, 2, function(x) {
sum(log(ifelse(index, x, 1 - x)))
})
ll = ll + class.prior[names(ll)]
odd = ll["POS"] - ll["NEG"]
}
pos.log.odds <- sapply(bow, log.odds)
We check graphically that the results make sense:
boxplot(pos.log.odds ~ labels)
We can recover the probability that a document belongs to the “POS” class as follows:
p = 1/(1/exp(pos.log.odds) + 1)
summary(p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.424 0.499 1.000 1.000
These probabilities reflects the confidence a document falls into a class. The naive bayes assumptions
I am not demonstrating any testing here for lack of time. Instead I summarize how I would do it. The data set is small so we can consider using Jackknife resampling [http://en.wikipedia.org/wiki/Jackknife_resampling] or the Bootstrap [http://en.wikipedia.org/wiki/Bootstrap_(statistics)]. Another possibility is to cross-validate. The objective of these methods is to verify we are not overfitting the training data, which would lead to poor generalization on new documents.
A possible implementation of these methods in R can be found here: [http://www.r-bloggers.com/functional-programming-in-r/] under the Closures heading.
I did not attempt to reduce the vocabulary although many terms are probably not useful for classification. For example the English definite article “the” can be safely removed. A list of stopwords can be found here [http://www.textfixer.com/resources/common-english-words.txt]
I did not use the frequency of the tokens in a document. This would be a straightforward extension of the Naive Bayes model. Another possibility is to model documents as a multinomial over the vocabulary. Another avenue is to weight token by tf-idf [http://en.wikipedia.org/wiki/Tf-idf]. While it is a good strategy for ranking, it is not clear at all that it would help in the classification context (see the paper above by Pedro Domingos)
Instead of selecting the terms a priori, we can use a method like LSI [http://en.wikipedia.org/wiki/Latent_semantic_indexing] to send the document in a subspace of lower dimension where hopefully the noise is reduced and as a consequence the classification is improved. If this method is pursued, it might be wise to add the documents of another collection to improve the lower dimensional space representation.
Finally, some significantly more sophisticated methods can be used to cluster tokens into to topics and either replace or complement the documents representation with these topics. An example of such a method is LDA [http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation]