Capstone Report: NLP - Word Predict

INTRODUCTION

This is an exercize in NLP. I have adopted a “literate statistics approach,” which makes all of my code accessible and my result duplicable. The reader uninterested in result duplication is invited to disregard the code lines framed within several subsequent windows.

Together with the assignment came 5 questions to consider, namely:
1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
I will tackle these questions one by one.
I start out by installing the required R packages, and loading and reading three of the text files provided for this exercise, namely, those in English. However, the same exercize could be executed, with minor changes to the code, on the provided files in different languages.

for (package in c('knitr', 'tm', 'RWeka', 'stringi', 'stringr', 'ggplot2', 'dplyr', 'wordcloud', 'NLP', 'openNLP', 'qdap', 'textcat')) {
      if (!require(package, character.only=T, quietly=T)) {
            install.packages(package)
            library(package, character.only=T, warn.conflicts=F, verbose=F, quietly=T)
      }
}
opts_chunk$set(echo=TRUE)

set.seed(33)
# define source and target for download
targetFile <- "Coursera-SwiftKey.zip"
sourceFile <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# download source to target
download.file(sourceFile, targetFile)
# unzip file
unzip(targetFile)
# open connections
twitterFile <- file("./final/en_US/en_US.twitter.txt", "rb")
newsFile <- file("./final/en_US/en_US.news.txt", "rb")
blogsFile <- file("./final/en_US/en_US.blogs.txt", "rb")
# read files 
twitter_crudo <- readLines(twitterFile, encoding = "UTF-8", skipNul = TRUE)
news_crudo <- readLines(newsFile, encoding = "UTF-8", skipNul = TRUE)
blogs_crudo <- readLines(blogsFile, encoding = "UTF-8", skipNul = TRUE)
# close connections
close(twitterFile)
close(newsFile)
close(blogsFile)

The next chunk of code computes basic information about the three selected text files.

News_numWordsPerLine <- stri_count_words(news_crudo)
News_totWords <- sum(News_numWordsPerLine)
News_sizeMb <- file.info("./final/en_US/en_US.news.txt")$size/1024^2
News_totLines <- length(news_crudo)
News_meanWordsPerLine <- mean(News_numWordsPerLine)
News_maxWordsPerLine <- max(News_numWordsPerLine)
News_characters <- sum(nchar(news_crudo))
News_maxCharactersPerLine <- max(nchar(news_crudo))

Blogs_numWordsPerLine <- stri_count_words(blogs_crudo)
Blogs_totWords <- sum(Blogs_numWordsPerLine)
Blogs_sizeMb <- file.info("./final/en_US/en_US.blogs.txt")$size/1024^2
Blogs_totLines <- length(blogs_crudo)
Blogs_meanWordsPerLine <- mean(Blogs_numWordsPerLine)
Blogs_maxWordsPerLine <- max(Blogs_numWordsPerLine)
Blogs_characters <- sum(nchar(blogs_crudo))
Blogs_maxCharactersPerLine <- max(nchar(blogs_crudo))

Twitter_numWordsPerLine <- stri_count_words(twitter_crudo)
Twitter_totWords <- sum(Twitter_numWordsPerLine)
Twitter_sizeMb <- file.info("./final/en_US/en_US.twitter.txt")$size/1024^2
Twitter_totLines <- length(twitter_crudo)
Twitter_meanWordsPerLine <- mean(Twitter_numWordsPerLine)
Twitter_maxWordsPerLine <- max(Twitter_numWordsPerLine)
Twitter_characters <- sum(nchar(twitter_crudo))
Twitter_maxCharactersPerLine <- max(nchar(twitter_crudo))

SUMMARY TABLE

I display here summary information about the text files, namely, their names, number of words, number of characters, MB size, number of text lines, mean number of words per line, and maximum number of words per line.

summaryTable <- data.frame(filename = c("blogs","news","twitter","total"),
            totWords = c(Blogs_totWords,News_totWords,Twitter_totWords,Blogs_totWords+
            News_totWords+Twitter_totWords),
            totCharacters = c(Blogs_characters, News_characters, Twitter_characters,
            Blogs_characters+News_characters+Twitter_characters),
            sizeMb = c(Blogs_sizeMb, News_sizeMb, Twitter_sizeMb,
            Blogs_sizeMb+News_sizeMb+Twitter_sizeMb),
            totLines = c(Blogs_totLines,News_totLines,Twitter_totLines,
            Blogs_totLines+News_totLines+Twitter_totLines),
            meanWordsPerLine = c(Blogs_meanWordsPerLine, 
            News_meanWordsPerLine,  Twitter_meanWordsPerLine,                                           (Blogs_totWords+News_totWords+Twitter_totWords)/(Blogs_totLines+News_totLines+
            Twitter_totLines)),
            maxWordsPerLine = c(Blogs_maxWordsPerLine, News_maxWordsPerLine, Twitter_maxWordsPerLine,
                                               ((Blogs_maxWordsPerLine*Blogs_sizeMb)+(News_maxWordsPerLine*News_sizeMb)+(Twitter_maxWordsPerLine*Twitter_sizeMb))/(Blogs_sizeMb+News_sizeMb+Twitter_sizeMb)))

summaryTable

##   filename  totWords totCharacters   sizeMb totLines meanWordsPerLine
## 1    blogs  37546246     206824505 200.4242   899288         41.75108
## 2     news  34762395     203223159 196.2775  1010242         34.40997
## 3  twitter  30093410     162096241 159.3641  2360148         12.75065
## 4    total 102402051     572143905 556.0658  4269678         23.98355
##   maxWordsPerLine
## 1        6726.000
## 2        1796.000
## 3          47.000
## 4        3071.683

As a personal memento, I include here the code used to answer some of the questions in quiz 1. Please disregard if uninterested.

#quiz 1.4
love_occurs <- sum(grepl(pattern = "love", x = twitter_crudo)) # of sentences where there is
# one or more occurrences of "love"
hate_occurs <- sum(grepl(pattern = "hate", x = twitter_crudo)) # same for word "hate"
answer_4 <- love_occurs/hate_occurs
#quiz1.5
answer_5 <- twitter_crudo[grep(pattern = "biostat", x = twitter_crudo)]

#quiz 1.6
answer_6 <- sum(grepl(pattern="A computer once beat me at chess, but it was no match for me at kickboxing",x=twitter_crudo))

Since our assignment invites us to purge our text files of profanities, here is my source of profane words.

profanities <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", encoding = "UTF-8")

TRAINING SET

Next I assemble a training set and display the same basic information displayed above about the three source files.

blogsSample <- sample(blogs_crudo, as.integer(Blogs_totLines*0.01))
newsSample <- sample(news_crudo, as.integer(News_totLines*0.01))
twitterSample <- sample(twitter_crudo, as.integer(Twitter_totLines*0.01))
#blogsSample <- sample(blogs_crudo, as.integer(Blogs_totLines*0.01))
#newsSample <- sample(news_crudo, as.integer(News_totLines*0.01))
#twitterSample <- sample(twitter_crudo, as.integer(Twitter_totLines*0.01))
#combine the 3 samples
threeSamples <- c(blogsSample,newsSample, twitterSample)
write.table(threeSamples, "threeSamples.txt", sep="\t")
#collect data for summary statistics of the data set which was obtained by attaching three 
#samples to one anothet
threeSamples_numWordsPerLine <- stri_count_words(threeSamples)
threeSamples_totWords <- sum(threeSamples_numWordsPerLine)
threeSamples_sizeMb <- file.info("threeSamples.txt")$size/1024^2
threeSamples_totLines <- length(threeSamples)
threeSamples_meanWordsPerLine <- mean(threeSamples_numWordsPerLine)
threeSamples_maxWordsPerLine <- max(threeSamples_numWordsPerLine)
threeSamples_characters <- sum(nchar(threeSamples))
threeSamples_maxCharactersPerLine <- max(nchar(threeSamples))

summaryTable2 <- data.frame(filename = "threeSamples", 
                           totWords = threeSamples_totWords,                           
                           totCharacters = threeSamples_characters,                           
                           sizeMb = threeSamples_sizeMb,
                           totLines = threeSamples_totLines,
                           meanWordsPerLine = threeSamples_meanWordsPerLine,
                           maxWordsPerLine = threeSamples_maxWordsPerLine)
summaryTable2

##       filename totWords totCharacters   sizeMb totLines meanWordsPerLine
## 1 threeSamples  1021430       5713476 5.919318    42695         23.92388
##   maxWordsPerLine
## 1             638

NLP essential computations

The next two windows of code contain most of the crucial NLP coding. Each section of code is complemented by a concise explanatory comment. All in all, the following lines of code:
1. create a text corpus out of the training text file,
2. preprocess the corpus, getting it rid of unhelpful elements and details;
2.1. separately from the above preprocessing, they rid the corpus of stopwords, which allows for more proficuous exploratory analysis (this step is kept separated from the overall corpus preprocessing because I won’t implement it in the next assignment, at the stage of word prediction, since I deem stopwords relevant to good predictive results);
3. create a term document matrix out of the corpus, which consists of each distinct line of text as a columns (or document) and each distinct word of text as row (or term);
4. remove the sparse terms from this matrix;
5. tokenize the matrix to generate separate mono-, bi- and trigrams.

# 6 sorts of elements to eliminate from text file
hashtags <- "#[0-9][a-z][A-Z]+"
special <- c("®","™", "¥", "£", "¢", "€", "#", "â€" , "ð" , "Ÿ˜","Š","í", "½","ð","$")
urls <- "(f|ht)tp(s?)://(.*)[.][a-z]+"
email <- "^[[:alnum:].-]+@[[:alnum:].-]+$"
date <- "[0-9]{2}/[0-9]{2}/[0-9]{4}"
control_characters <- '[[:cntrl:]]'

# helper function to remove above 6 elements
removeFunction <- function(textFile) {
      textFile <- gsub(paste0(control_characters),"", textFile)
      textFile <- gsub(paste0(urls),"", textFile)
      textFile <- gsub(paste0(email),"", textFile)
      textFile <- gsub(paste0(special, collapse = '|'),"",textFile)
      textFile <- gsub(paste0(hashtags),"", textFile)
      textFile <- gsub(paste0(date), "", textFile)
      return (textFile)
}

# helper function to preprocess corpus
preProcessFunction <- function(myCorpus) {
      myCorpus <- tm_map(myCorpus, content_transformer(tolower))
      myCorpus <- tm_map(myCorpus, removePunctuation)
      myCorpus <- tm_map(myCorpus, removeWords, profanities)
      #myCorpus <- tm_map(myCorpus, removeWords, stopwords("en"))
      myCorpus <- tm_map(myCorpus, removeNumbers)
      #myCorpus <- tm_map(myCorpus, stemDocument)
      myCorpus <- tm_map(myCorpus, stripWhitespace)
      return (myCorpus)
}

#helper function to rid corpus of stopwords 
preProcessFunction2 <- function(myCorpus2) {
      myCorpus2 <- tm_map(myCorpus2, removeWords, stopwords("en"))
      return (myCorpus2)
}

# helper function to compute frequencies of termDocMatrixC
data_frame <- function(matrice){
      frequenze <- sort(rowSums(as.matrix(matrice)), decreasing=TRUE)
      frame <- data.frame(word=names(frequenze), freq=frequenze)
      return(frame)
}

# helper functions to generate (tokenize) multi-grams
biGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

# apply removeFunction()
threeSamples <- removeFunction(threeSamples)
write.table(threeSamples, "threeSamples.txt", sep="\t")
#create corpus
threeSamplesAsCorpus <- VCorpus(VectorSource(threeSamples))

preProcessedCorpus <- preProcessFunction(threeSamplesAsCorpus)
preProcessedCorpus2 <- preProcessFunction2(preProcessedCorpus)

# exploratory analysis is to be executed on preProcessedCorpus2
# predictive work will be executed on preProcessedCorpus (inclusive of stopwords)

termDocMatrix2 <- TermDocumentMatrix(preProcessedCorpus2)
# remove sparse elements
termDocMatrix2e <- removeSparseTerms(termDocMatrix2, 0.997)

freqDataFrame2 <- data_frame(termDocMatrix2e)

## Sets the default number of threads to use in parallel library
# see brian.keng at http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka
options(mc.cores=1)

termDocMatrixByTwo <- TermDocumentMatrix(preProcessedCorpus2, control=list(tokenize=biGramTokenize))
termDocMatrixByTwoE <- removeSparseTerms(termDocMatrixByTwo, 0.99973)
freqDataFrameByTwoE <- data_frame(termDocMatrixByTwoE)

termDocMatrixByThree <- TermDocumentMatrix(preProcessedCorpus2, control=list(tokenize=triGramTokenize))
termDocMatrixByThreeB <- removeSparseTerms(termDocMatrixByThree, 0.99997)
freqDataFrameByThreeB <- data_frame(termDocMatrixByThreeB)

EXPLORATORY ANALYSIS

An important lesson to be learned from the above computations is the sensitivity of the sparse factor that is used in the removeSparseTerms() function. Minimal variations in its value have substantial consequences on the size of the term doc matrix.
The frequency data frame of my monograms contains 632 monograms.
The frequency data frame of my bigrams contains 929 bigrams.
The frequency data frame of my trigrams contains 3460 trigrams.
Next I display the top 20 terms of each one of these frequency data frames.

freqDataFrame2[1:20,]

##          word freq
## will     will 3181
## said     said 3080
## just     just 2916
## one       one 2886
## like     like 2690
## can       can 2501
## get       get 2221
## time     time 2031
## new       new 1976
## now       now 1786
## good     good 1767
## dont     dont 1692
## day       day 1652
## people people 1568
## love     love 1539
## know     know 1500
## see       see 1416
## back     back 1355
## think   think 1316
## make     make 1306

freqDataFrameByTwoE[1:20,]

##                            word freq
## right now             right now  237
## new york               new york  204
## cant wait             cant wait  199
## last year             last year  194
## dont know             dont know  187
## last night           last night  171
## high school         high school  145
## feel like             feel like  137
## years ago             years ago  136
## im going               im going  123
## last week             last week  119
## im sure                 im sure  116
## first time           first time  114
## dont think           dont think  111
## looking forward looking forward  107
## happy birthday   happy birthday  106
## can get                 can get   97
## looks like           looks like   96
## make sure             make sure   96
## st louis               st louis   93

freqDataFrameByThreeB[1:20,]

##                                          word freq
## cant wait see                   cant wait see   43
## happy mothers day           happy mothers day   36
## happy new year                 happy new year   24
## new york city                   new york city   24
## im pretty sure                 im pretty sure   22
## let us know                       let us know   21
## president barack obama president barack obama   20
## two years ago                   two years ago   17
## gov chris christie         gov chris christie   15
## st louis county               st louis county   14
## dont get wrong                 dont get wrong   13
## world war ii                     world war ii   12
## dont even know                 dont even know   11
## dont feel like                 dont feel like   11
## happy valentines day     happy valentines day   11
## high school students     high school students   11
## im looking forward         im looking forward   11
## looking forward seeing looking forward seeing   11
## rock n roll                       rock n roll   11
## cant wait get                   cant wait get   10

Next I plot the histograms of the same top 20 terms.

#plot the unigrams
ggplot(freqDataFrame2[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Unigrams")

# plot the plot of bigrams
ggplot(freqDataFrameByTwoE[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Bigrams")

# plot the plot of trigrams
ggplot(freqDataFrameByThreeB[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
      geom_bar(stat="identity") +
      theme(axis.title.y = element_blank()) +
      coord_flip() +
      labs(y="Frequency", title="Most Common Trigrams")

WORD CLOUDS

Next I display two word-cloud representations, namely, first of my top-twenty bigram frequencies and second of my top-twenty trigram frequencies.

# compute word cloud of bigrams
nuvolaDeiBigrams <- wordcloud(freqDataFrameByTwoE[,1], freqDataFrameByTwoE[,2],max.words=20, random.order=FALSE, 
          rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

# compute word cloud of trigrams
nuvolaDeiTrigrams <- wordcloud(freqDataFrameByThreeB[,1], freqDataFrameByThreeB[,2],max.words=20, random.order=FALSE, 
          rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

ANSWERS TO THE ASSIGNMENT’S LAST THREE QUESTIONS

Thus far I have tackled the first two of the 5 questions under consideration. Next comes this question: “How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?” Let me write a suitable program, and extend my answer to unique words, bi-grams, and tri-grams.

wordsForTargetCoverage <-function(freqDataFrame,targetCoverage) {
      nwords<-0 # initial counter
      aimedAtCoverage<-targetCoverage*sum(freqDataFrame$freq) 
      for (i in 1:nrow(freqDataFrame)) {
            if (nwords >= aimedAtCoverage) {
                  return (i)}
      nwords<-nwords+freqDataFrame$freq[i]
      }}

In order to cover 50% of all word instances in the language (precisely, in my training text file, which, as we saw above, includes 1021430 words), I need 129 words.
In order to cover 90% of all word instances, I need 482 words.

In order to cover 50% of all bi-gram instances in the language (precisely, in my training text file, which includes, after tokenization and sparsity decrease, 929 bi-grams), I need 236 bi-grams.
In order to cover 90% of all bi-gram instances, I need 751 bi-grams.

In order to cover 50% of all tri-gram instances in the language (precisely, in my training text file, which includes, after tokenization and sparsity decrease, 3460 tri-grams, I need 1356 trigrams.
In order to cover 90% of all trigram instances, I need 3040 tri-grams.

Question 4: “How do you evaluate how many of the words come from foreign languages?”
With the function textcat() from the package “textcat” one ought to be able to loop through the words of a text file and delete the words that are not identified as English words. Then one could compute the ratio of the new, “purged” text file and the original taxt file to ascertain the ratio of foreign words. But I see that this requires further probing of the functionality of that package. Moreover, since the English language borrows innumerable words from foreign languages, the risk of this procedure is that it would deprive the original text file of meaningful sentences.

Question 5: “Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?”
Two other ways to increase coverage are context clustering and word clustering. Both of them pertain to the task of attribuing specific meanings, or senses, to a word, so they are both related to the task of word-disambiguation. The former is aimed at clustering the contexts in which the target word occurs, the latter at clustering the words which are related to the target word. It seems to me that a proper use of these two methods could increase coverage without increasing the number of words. But it’s all a bit above my head for now.
A way to increase coverage is thorugh stemming. In the above preprocess function, I deliberately commented off the command < tm_map(myCorpus, stemDocument) >, which would have stemmed the words in my training file to their semantic root. Stemming has the great advantage that, for instance, it turns the noun “stop” and the verb “stopped” into the same word: “stop.” So, yes, it increases coverage by decreasing the number of words needed to cover a certain desired ratio of the language. But stemming suffers from a serious disadvantage in word prediction, which is the main goal of this stepstone project. I mean, if you want to know what comes next after “I stopped by the,” it is unlikely that my application would give you a good result if the verb “to stop” never appeared in my data set in the guise of a past tense form.
Lastly, the second part of this capstone project entails implementing a next-word prediction algorithm and display the result in a Shiny app. For my algorithm, I plan to rely on the Markov Chain principle, which enables one to disregard the overall semantic and contextual history of the queried phrase, relying instead on the most recent semantic history conveyed by bi-grams and tri-grams. Secondly, I plan to rely on the backoff principle, which enables one to switch back to shorter n-grams when the longer n-grams (my longest n-grams will be tri-grams) do not provide predictive results. Thirdly, in the light of the Kneser-Ney smoothing paradigm, designed to counter data sparsity via discounting (i.e. by attributing lower probability to n-grams present in the language [aka corpus] and some small probability to n-grams absent from it), I plan to increase the range of my available n-grams by the adoption of “wild cards.” A wild card is equivalent to “any word” (in my case, any word from the language [or corpus])." More specifically, in my quest for the next-word, I’ll first check if there are any tri-grams whose first two words correspond to the last two words of the queried phrase. If yes, I’ll select, as my predicted word, the last word from the tri-gram with the highest probability. If not, I’ll check if there are any tri-grams whose first word is a wild card and whose second word corresponds to the last word of the queried phrase. If yes, the third word from the most likely of these tri-grams (which to all intents and purposes are the bi-gams from my corpus) will provide my predicted word. If not, I’ll switch to examining the tri-grams whose middle word is a wild card. And so on, consistently with the backoff principle.