Milestone Report

INTRODUCTION

This is an exercize in NLP. I have adopted a “literate statistics approach,” which makes all of my code accessible and my result duplicable. The reader uninterested in result duplication is invited to disregard the code lines framed within several subsequent windows. I start out by installing the required R packages, and loading and reading three of the text files provided for this exercise, namely, those in English. However, the same exercize could be executed, with minor changes to the code, on the provided files in different languages.

for (package in c('knitr', 'tm', 'RWeka', 'stringi', 'stringr', 'ggplot2', 'dplyr', 'wordcloud', 'NLP', 'openNLP', 'qdap')) {
    if (!require(package, character.only=T, quietly=T)) {
        install.packages(package)
        library(package, character.only=T, warn.conflicts=F, verbose=F, quietly=T)
    }
}

## Warning: package 'knitr' was built under R version 3.2.3

## Warning: package 'NLP' was built under R version 3.2.3

## Warning: package 'RWeka' was built under R version 3.2.4

## Warning: package 'ggplot2' was built under R version 3.2.4

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'openNLP' was built under R version 3.2.3

## Warning: package 'qdapRegex' was built under R version 3.2.3

## 
## Attaching package: 'qdapRegex'

## The following objects are masked from 'package:dplyr':
## 
##     escape, explain

## The following object is masked from 'package:ggplot2':
## 
##     %+%

## 
## Attaching package: 'qdapTools'

## The following object is masked from 'package:dplyr':
## 
##     id

## 
## Attaching package: 'qdap'

## The following object is masked from 'package:dplyr':
## 
##     %>%

## The following object is masked from 'package:stringr':
## 
##     %>%

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:base':
## 
##     Filter

opts_chunk$set(echo=TRUE)

set.seed(33)
# define source and target for download
targetFile <- "Coursera-SwiftKey.zip"
sourceFile <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# download source to target
download.file(sourceFile, targetFile)

# unzip file
unzip(targetFile)

# open connections
twitterFile <- file("./final/en_US/en_US.twitter.txt", "rb")
newsFile <- file("./final/en_US/en_US.news.txt", "rb")
blogsFile <- file("./final/en_US/en_US.blogs.txt", "rb")

# read files and close connections; read list of profanities for future remotion
twitter_crudo <- readLines(twitterFile, encoding = "UTF-8", skipNul = TRUE)
news_crudo <- readLines(newsFile, encoding = "UTF-8", skipNul = TRUE)
blogs_crudo <- readLines(blogsFile, encoding = "UTF-8", skipNul = TRUE)

profanities <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", encoding = "UTF-8")

The next chunk of code computes basic information about the three selected text files.

News_numWordsPerLine <- stri_count_words(news_crudo)
News_totWords <- sum(News_numWordsPerLine)
News_sizeMb <- file.info("./final/en_US/en_US.news.txt")$size/1024^2
News_totLines <- length(news_crudo)
News_meanWordsPerLine <- mean(News_numWordsPerLine)
News_maxWordsPerLine <- max(nchar(news_crudo))

Blogs_numWordsPerLine <- stri_count_words(blogs_crudo)
Blogs_totWords <- sum(Blogs_numWordsPerLine)
Blogs_sizeMb <- file.info("./final/en_US/en_US.blogs.txt")$size/1024^2
Blogs_totLines <- length(blogs_crudo)
Blogs_meanWordsPerLine <- mean(Blogs_numWordsPerLine)
Blogs_maxWordsPerLine <- max(nchar(blogs_crudo))

Twitter_numWordsPerLine <- stri_count_words(twitter_crudo)
Twitter_totWords <- sum(Twitter_numWordsPerLine)
Twitter_sizeMb <- file.info("./final/en_US/en_US.twitter.txt")$size/1024^2
Twitter_totLines <- length(twitter_crudo)
Twitter_meanWordsPerLine <- mean(Twitter_numWordsPerLine)
Twitter_maxWordsPerLine <- max(nchar(twitter_crudo))

# close connections
close(twitterFile)
close(newsFile)
close(blogsFile)

SUMMARY TABLE

I display here summary information about the text files, namely, their names, MB size, # of text lines, total # of words, mean # of words per line of text, and maximum # of words per line of text.

summaryTable <- data.frame(filename = c("blogs","news","twitter"), 
                  sizeMb = c(Blogs_sizeMb, News_sizeMb, Twitter_sizeMb),
                  totLines = c(Blogs_totLines,News_totLines,Twitter_totLines),
                  totWords = c(Blogs_totWords,News_totWords, Twitter_totWords),
                  meanWords = c(Blogs_meanWordsPerLine, News_meanWordsPerLine, 
                                Twitter_meanWordsPerLine),
                  maxWords = c(Blogs_maxWordsPerLine, News_maxWordsPerLine, 
                               Twitter_maxWordsPerLine))
summaryTable

##   filename   sizeMb totLines totWords meanWords maxWords
## 1    blogs 200.4242   899288 37581903  41.79073    40833
## 2     news 196.2775  1010242 34858293  34.50489    11384
## 3  twitter 159.3641  2360148 30162829  12.78006      140

blogsSample <- sample(blogs_crudo, as.integer(Blogs_totLines*0.009))
newsSample <- sample(news_crudo, as.integer(News_totLines*0.009))
twitterSample <- sample(twitter_crudo, as.integer(Twitter_totLines*0.009))

#combine the 3 samples
threeSamples <- c(blogsSample,newsSample, twitterSample)

# gather info about file threeSamples
threeSamples_totWords <- sum(stri_count_words(threeSamples))
threeSamples_totLines <- length(threeSamples)

THE TRAINING SAMPLE

For the sake of speedy computation, I select small random samples from the three files, amounting to the 0.9% of the total # of lines. I then combine these samples into one file. The resulting training file includes 38426 lines and 922708 words.

NLP essential computations

The next two windows of code contain most of the crucial NLP coding. Each section of code is complemented by a concise explanatory comment. All in all, the following lines of code:
1. create a text corpus out of the training text file,
2. preprocess the corpus, getting it rid of unhelpful elements and details;
3. create a term document matrix out of it, which consists of each document as columns and distinct words as rows;
4. remove the sparse terms from this matrix;
5. tokenize the matrix to generate separate mono-, bi- and trigrams.

# helper function to preprocess corpus
preProcessFunction <- function(myCorpus) {
      myCorpus <- tm_map(myCorpus, content_transformer(tolower))
      myCorpus <- tm_map(myCorpus, removePunctuation)
      myCorpus <- tm_map(myCorpus, removeWords, profanities)
      myCorpus <- tm_map(myCorpus, removeWords, stopwords("en"))
      myCorpus <- tm_map(myCorpus, removeNumbers)
      myCorpus <- tm_map(myCorpus, stemDocument)
      myCorpus <- tm_map(myCorpus, stripWhitespace)
      return (myCorpus)
}
# helper function to compute frequencies
data_frame <- function(matrice){
      frequenze <- sort(rowSums(as.matrix(matrice)), decreasing=TRUE)
      frame <- data.frame(word=names(frequenze), freq=frequenze)
      return(frame)
}

# helper functions to generate (tokenize) multi-grams
biGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
#quadGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
#pentaGramTokenize <- function(matrice) NGramTokenizer(matrice, Weka_control(min=5, max=5))

#create corpus
threeSamplesAsCorpus <- VCorpus(VectorSource(threeSamples))

#apply pre-processing helper functions
termDocMatrixA <- preProcessFunction(threeSamplesAsCorpus)

# create unsparsed term document matrix
termDocMatrixB <- TermDocumentMatrix(termDocMatrixA)
termDocMatrixC <- removeSparseTerms(termDocMatrixB, 0.99)

# compute frequency data frame
freqDataFrame1 <- data_frame(termDocMatrixC)

## Sets the default number of threads to use in parallel library
# see brian.keng at http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka
options(mc.cores=1)

termDocMatrixByTwoA <- TermDocumentMatrix(termDocMatrixA, control=list(tokenize=biGramTokenize))
termDocMatrixByTwoB <- removeSparseTerms(termDocMatrixByTwoA, 0.999)
freqDataFrameByTwo <- data_frame(termDocMatrixByTwoB)

termDocMatrixByThreeA <- TermDocumentMatrix(termDocMatrixA, control=list(tokenize=triGramTokenize))
termDocMatrixByThreeB <- removeSparseTerms(termDocMatrixByThreeA, 0.9999)
freqDataFrameByThree <- data_frame(termDocMatrixByThreeB)

# termDocMatrixByFourA <- TermDocumentMatrix(termDocMatrixA, control=list(tokenize=quadGramTokenize))
# termDocMatrixByFourB <- removeSparseTerms(termDocMatrixByFourA, 0.9999)
# freqDataFrameByFour <- data_frame(termDocMatrixByFourB)

FREQUENCY DIAGRAMS

Next I display the frequency histograms of the most common words, bigrams, and trigrams.

# plot the plot of unigrams
ggplot(freqDataFrame1[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Unigrams")

# plot the plot of bigrams
ggplot(freqDataFrameByTwo[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Bigrams")

# plot the plot of trigrams
ggplot(freqDataFrameByThree[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Trigrams")

# plot the plot of quadgrams
# ggplot(freqDataFrameByFour[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
#       geom_bar(stat="identity") +
#       theme(axis.title.y = element_blank()) +
#       coord_flip() +
#       labs(y="Frequency", title="Most Common Quadgrams")

WORD CLOUDS

Next I display the word-cloud representation of the bigram and trigram frequencies.

# compute word cloud of bigrams
nuvolaDeiBigrams <- wordcloud(freqDataFrameByTwo[,1], freqDataFrameByTwo[,2],max.words=20, random.order=FALSE, 
          rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

## Warning in wordcloud(freqDataFrameByTwo[, 1], freqDataFrameByTwo[, 2],
## max.words = 20, : thank follow could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(freqDataFrameByTwo[, 1], freqDataFrameByTwo[, 2],
## max.words = 20, : happi birthday could not be fit on page. It will not be
## plotted.

# compute word cloud of trigrams
nuvolaDeiTrigrams <- wordcloud(freqDataFrameByThree[,1], freqDataFrameByThree[,2],max.words=20, random.order=FALSE, 
          rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

## Warning in wordcloud(freqDataFrameByThree[, 1], freqDataFrameByThree[,
## 2], : happi mother day could not be fit on page. It will not be plotted.

## Warning in wordcloud(freqDataFrameByThree[, 1], freqDataFrameByThree[,
## 2], : want make sure could not be fit on page. It will not be plotted.

## Warning in wordcloud(freqDataFrameByThree[, 1], freqDataFrameByThree[,
## 2], : keep good work could not be fit on page. It will not be plotted.

FINDINGS

This initial exercise in NLP was fraught with incognitae ??? to be solved, hopefully, by way of my peers??? advice and the next few lectures. My findings from this exercise do not pertain so much to the specific words??? and grams??? frequencies displayed in the above tables, histograms, and word clouds ??? these frequencies are self-explanatory. My major findings pertain rather to the problems ahead - so much so that each of the subsequent ???findings??? belongs as well in ???Future Plans,??? the next and last section of the present report.
It???s also clear from my coding that I opted to stem words to their root. Again, it seems intuitively obvious that such stemming may bring about relevant advantages, but, given one???s goals and tasks, it could very well backfire. Further guidance is needed as well.
It???s clear from my coding that I opted to remove stop words from the training text file. Was that a wise decision? To answer this question competently, I ought to find out, first of all, what specific words, besides “the, is, at, which, and on,” are considered stop words by the ???tm??? package _ as I write this, I don???t know it with sufficient precision yet. Secondly, I ought to gauge the consequences of dropping from my language model such stop words as “the, is, at, which, and on.” This elimination brings out more appealing, if you will, and less predictable grams and multi-grams from the tex file under examination, but does it do justice to one???s tasks and/or goals? Is the presence/absence of stop words equally beneficial/detrimental to a prediction model, a translation model, a speech recognition model? Admittedly, all these sorts of NLP models seem to be based on analogous, probabilistic assumptions. Yet, it???s more than likely that on certain occasions stop words have a big role to play, and on other occasions stop words ought to be expeditiously disposed of. Further guidance and further investigation are both needed.
Another significant finding of mine is that NLP models entail huge computation times. True, one can generate, as I did, training samples much smaller than the original files, but there is a limit to how small one can go and still reach significant results. I need to understand how to establish the minimal size of a statistically significant training text file. Moreover, it is to be hoped that the next lessons will show us how to manipulate adequately large files without clogging our machine.
Through repeated trials, I came to realize how sensitive the term document matrix is to the sparse ratio adopted in the remotion of sparse terms. Again, this is an issue of enormous relevance,for which extra instruction is needed.
Finally, I realized that the risk of overfitting is much more problematic in NLP than in ordinary regression and classification. If you train your model on texts by Shakespeare, overfitting is inevitable if you hope to apply your language model to Hemingway. Another cavil to ponder about.
It goes without saying that most of the problems I raised in this section could be tackled by comparing the accuracy reached by alternative models. It???s just that whenever the combination of alternative possibilities grows exponentially (and it does it fast, when it comes to dilemmas of the sorts I just mentioned), it???s better to be taught the road(s) most taken rather than trying innumerable (computationally costly) paths on one???s own.

FUTURE PLANS

As I understand it, our future assignment will consist of building a predictive text-mining application on the Shiny Platform. So, what I mean to do is this: (1) Build a basic n-gram model, based possibly on the ???Stupid back-off??? algorithm (but for now I keep my options open); (2) and enable this model to handle unseen or out-of-vocabulary n-grams, which amounts to an ???open vocabulary task.??? (The conventional recipe for this sort of task entails creating an unknown word token and a fixed lexicon of words from the training set; the subsequent stage of model training is not limited thereby to assigning probabilities just to the words from the fixed lexicon, but also to any word not comprised in the lexicon, after identifying this word with the token.)
Thanks for reading and for your feedback.