Predicting Next Word Using Katz Back-Off

Background

World wide mobile internet usage is projected to continue its rapid growth over the next few years. According to a Statista Fact Sheet, the percentage of mobile phone users accessing the internet will rise to 63.4% in 2019 up from 48.8% in 2014. This increased ownship has resulted in more people spending increasing amounts of time on mobile devices for email, social networking, banking and other activities. Because typing on these devices is an awkward and tedious task, smart keyboard applications based on predictive text analytics have emerged to make typing easier.

Objectives

The overall goal of this project was to develop a prototype predictive web application that suggests the next word in a message of text based on what has been typed in by the user. For example, a user may type I love Italian and the the application might suggest: food, shoes, or opera. This goal was broken down into four objectives:

Objective #1 - Partition and Clean the Data

In addition to the objective, this document describes the details around how the data was partitioned, how it was cleaned prior to n-gram table construction, and what exploratory data analysis (EDA) was done. This was the most time objective which was no surprise given what has been reported in the literature.

The data was initially split into an 80% training set and a 20% test set. All cleaning, exploratory data analysis (EDA) and parameter optimization work was performed on the training set. All code used to perform any analysis step is listed in the Appendix and referred to throughout this document.

Objective #2 - N-grams and Exploratory Data Analysis (EDA)

Once the data was cleaned, unigram, bigram, and trigram frequency tables were constructed. These tables are at the heart of the data required by the model to make its predictions. Some EDA was performed on these tables to gain insights which would later serve the model development phase.

Objective #3 - Model Development

A Katz Back-Off (KBO) Trigram Language Model (LM) was selected as the algorithm to make these predictions. A detailed description of how this algorithm works can be found here. The KBO Trigram was chosen for three main reasons:

Size - One of the simpliest LMs considered was the A Stupid Back-Off (SBO) model. This model tends to do well with large web-scale n-gram tables. The authors of this algorithm built tarabyte-sized language models. Because this app was deployed on a 1Gb shinyapp.io instance, it seem reasonable to assume that there was enough data to do a decent job with predictions.
Accuracy on small corpus - The SBO does not consider or account for unobserved n-grams, but instead backs off to the nearest matched n-gram until it reaches the unigram. The KBO model incorporates a form of smoothing in order to estimate probabilities of unobserved n-grams which appears to be more accurate when using a limited number of n-grams (trigrams in this case).
Interesting and challenging - Because the SBO is much simplier to implement, I felt the project would be more intersting and challenging to work on if the more complex KBO algorithm was selected for implementation.

Objective #4 - Parameter Optimization

As described here, the KBO Trigram has two parameters: the bigram discount rate (\(\gamma_2\)) and the trigram discount rate (\(\gamma_3\)). The default values for both of these parameters was set to 0.5 in the web app, but values which improved the accuracy were obtained using cross-validation. Better performing values for these parameters and the process by which they were obtained will be available here when this work has been completed:

http://rpubs.com/mszczepaniak/predictkbo4cv.

Objective #5 - Make the Work Reproducible

This objective was addressed concurrently while working on each of the prior objectives by making sure that each function was properly documented, that all code was made available, and links to all output files were provided. All pre-processing code (Objective #1) is provided in the Appendix section of this R Markdown document.

Objective #6 - Deploy the App on shinyapps.io

The results of meeting this objective can be found by visiting:

https://michael-szczepaniak.shinyapps.io/predictnextkbo/

Acquiring and Partitioning the Data

The data was originally downloaded from this link and stored locally. If this link is no longer available, the data can obtained from my dropbox. The zip file was download to a directory called data in a local project and unzipped there. The the unzipped data contained four subdirectories: de_DE (German), en_US (US english), fi_FI (Finnish), and ru_RU (Russian). This project focuses on the English corpora residing in the en_US folder. This folder contains three files named en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

With the data residing locally, the writeTrainTestFiles function was used to read each of the three data files, partition each them into 80% training and 20% hold-out test sets, and then write these two partitioned files locally. The training set was used to train the model to determine values for the two model parameters: bigram discount rate and trigram discount rate. Descriptions of what these parameters are be found here.

Cleaning and Preparing the Data

After the data was read and partitioned into training and test sets, the training files were cleaned using the following steps:

Sentence Parsing - This step broke the text into sentences. This changed the character vectors read from the data files from being one line per element to one sentence per element. One artifact of this step was that things like “St.” which are used as a contraction for “Saint” were mistakenly taken to be the end of a sentence by the parser. There were other sentence parsing errors, but they were intentionally ignored to get the project done sooner.
Non-ASCII Character Filtering - This step removed all special characters and converted the text to lower case.
Unicode Tag Conversions and Filtering - The initial reading of the data in the partitioning step generated unicode tags which needed to be processed to avoid model accuracy degradation. This step converted or removed all the unicode tags that were generated during the initial data partitioning.
URL Filtering - An extensive number of URL’s were found in the corpus documents. These are artifacts that the model was not designed to predict, so they were removed. Because these URL’s took on a number of forms (e.g. http:\\…, www.example.com, etc) this filtering step focused on removing a majority of these fragments.
Pre-EOS Marker Filtering - End-of-Sentence (EOS) markers were used to make it easier for the model to differentiate phrases that did and did not span sentence boundaries. This was essentially a clean-up step.
Add EOS Markers - A special EOS token replaced all tokens that typically mark the end of sentences (e.g. periods, question marks, etc.)
Post-EOS Marker Filtering - The pre-EOS and add-EOS marker steps generated additional fragments that were cleaned up in this step.

The above steps were executed off-line (not in this R Markdown file) because they were time intensive. To speed future processing, the three cleaned en_US files were rewritten back to the local file system so they could be read in again later.

Sentence Parsing

The parseSentsToFile function was used to parse the data into a single sentence per line and then write the resulting files out for later processing. Two files per each of the original data files were generated at this step. These files were:

Files ending in 1sents.txt are the files initially parsed into sentences before fixing the issue of improper sentence breaks between the tokens St. SomeSaintsName. Files ending in 2sents.txt are the files that were ran through the annealSaintErrors function to fix these errors.

Because the annealSaintErrors function didn’t fix all the St. SomeSaintsName parse errors, each 2sents.txt file was opened on Windows 10 in NotePad++ v6.9.2 aka NP++ and the following steps were performed:

Make sure the cursor is at the beginning of the file.
Execute a Ctrl + H to open the Replace window as shown below:

In the Find what field, enter the following regular expression: (St[.])(\r\n)([A-Z]+)
In the Replace with field, enter \1 \3
In the Search Mode section, click the Regular expression radio button.
Click the Replace All button.
Save the file by clicking the disk icon.
Click File from the upper menu and select Exit to exit the program.

If future versions of NP++ do not behave the same as described for v6.9.2, this version can be downloaded here

Additional Filtering

Additional filtering was accomplished using a variety of functions which are described below. To make the filtering process more manageable, the runFilterAndWrite function was created to take a function as a parameter, run that function against an input file, and then write the results to a file. This reduced duplicate code by consolidating all the code that does file IO to a single function.

Non-ASCII Character Filtering

After completion of sentence parsing, the resulting files were UTF-8 encoded but still contained some unusual characters. To clean out these characters, each file was ran through the function convertToAscii by passing it to runFilterAndWrite. The convertToAscii function passed the contents through an ASCII encoder and then rewrote out the contents with a .3ascii.txt suffix. Links to these files are provided below:

Unicode Tag Conversions and Filtering

This step removed all lines that contained unusual characters and resulted in removing 29 lines from the blogs file, 14 lines from the news, file and only 5 lines from the twitter file. However, these files still containted a lot of tags of the form <U+hhhh> where h is a hex digit from 0 to F. These were unicode tags that were generated during the data partitioning step when the partitioned files were written back out. As a result, the next challenge was figuring out how to deal with these tags. The following four options were considered:

leave them in the corpus
remove all lines that contained these tags
remove just the tags and leave the lines intact
figure out reasonable ASCII character substitutions for each tag and do these substitutions

Options 1. and 2. were ruled out because there were so many of these tokens: over 600k in the blogs file, over 200k in the news file, and over 100k in the twitter file, and keeping them would have distorted the n-gram tables while removing the lines would have reduced the n-gram counts. Both results would have reduced the quality of the n-gram tables used by the model which would degrade its accuracy.

In evaluating the remaining two options, frequency tables were constructed using the writeUnicodeTagFreqTables function. The resulting tables were:

A vast majority of these tags were <U+FFFD> which is a special code used to replace an unknown or unrepresentable character. Upon manual inspection, it appeared that the most frequent use of this character which we would want to preserve was as a single quote in a contraction such as isn’t or doesn’t or as a plural possesive form such as Tom’s or Mary’s. The convertUnicodeTags function was created and passed to runFilterAndWrite in order to replace unicodes tags with either a single quote or a space (which gets cleaned up in a later step). The outputs from this function resulted in the following 3 files:

URL Filtering

The removeUrls function was passed to runFilterAndWrite to remove the URL’s. Several of the most common forms were designed to be captured, and the frequency of ones that were not captured were deemed acceptable low. The outputs from this function resulted in the following 3 files:

Pre-End-of-Sentence Marker Filtering

The preEosClean function was passed to runFilterAndWrite to do the filtering prior to EOS marker insertion. This step in the pipeline removed things like unneeded dashes and single quotes. The outputs from this function resulted in the following 3 files:

Add End-of-Sentence Markers

The addEosMarkers function was passed to runFilterAndWrite to do the EOS token insertion. The outputs from this function resulted in the following 3 files:

Post-EOS Marker Filtering

The postEosClean function was passed to runFilterAndWrite to do the filtering after to EOS marker insertion. The outputs from this function resulted in the following 3 files:

The next steps in the data analysis pipeline are described in Predicting Next Word Using Katz Back-Off: Part 2 - N-grams and Exploratory Data Analysis.

Appendix

# install packages if needed
list.of.packages <- c('dplyr', 'readr', 'stringr', 'quanteda')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages) > 0) install.packages(new.packages)
# load libraries
# libs <- c('dplyr', 'readr', 'quanteda')
lapply(list.of.packages, require, character.only=TRUE)  # load libs
options(stringsAsFactors = FALSE)  # strings are what we are operating on...
# set parameters
ddir <- "../data/en_US/" # assumes exec from dir at same level as data
fnames <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
fnames.train <- c("en_US.blogs.train.txt", "en_US.news.train.txt",
                  "en_US.twitter.train.txt")

## Reads the text corpus data file and returns a character array where every
## element is a line from the file.
## fileId = string, text fragment of file name to be read e.g. 'blogs', 'news',
##          or 'twit'
## dataDir = path to data file to be read
## fnames = file names to be read which have fileId fragments
getFileLines <- function(fileId, dataDir=ddir, fileNames=fnames) {
    if(grep(fileId, fnames) > 0) index <- grep(fileId, fnames)
    else {
        cat('getFileLines could undestand what file to read:', fileId)
        return(NULL)
    }
    fileLines <- read_lines(sprintf("%s%s", dataDir, fnames[index]))
    return(fileLines)
}

## Breaks the en_US.<fileType>.txt into training and test sets and writes out
## these files.
## fileType - string, one of 3 values: 'blogs', 'news', or 'twitter'
## train.fraction - float between 0 and 1, fractional amount of data to be used
##                  in the training set
## dataDir - relative path to the data directory
writeTrainTestFiles <- function(fileType, train.fraction=0.8,
                                dataDir=ddir) {
    set.seed(71198)
    prefix <- "en_US."
    in.postfix <- ".txt"
    train.postfix <- ".train.txt"
    test.postfix <- ".test.txt"
    infile <- sprintf("%s%s%s%s", dataDir, prefix, fileType, in.postfix)
    dat <- getFileLines(fileType)
    line.count <- length(dat)
    train.size <- as.integer(train.fraction * line.count)
    test.size <- line.count - train.size
    train.indices <- sample(1:line.count, train.size, replace=FALSE)
    train.indices <- train.indices[order(train.indices)]
    test.indices <- setdiff(1:line.count, train.indices)
    train.set <- dat[train.indices]
    ofile <- sprintf('%s%s%s%s', dataDir, prefix, fileType, train.postfix)
    writeLines(train.set, ofile)
    test.set <- dat[test.indices]
    ofile <- sprintf('%s%s%s%s', dataDir, prefix, fileType, test.postfix)
    writeLines(test.set, ofile)
    
    # return(list(train=train.indices, test=test.indices))
}

## Returns a character vector where every element is a sentence of text.
##
## NOTE1: This function will improperly parse "St. Something" into 2 sentences.
##        It makes other mistakes (e.g. Ph.D.) which one could spend a crazy amount of time
##        fixing, but these others errors are ignored in the interest of time.
##
##        To fix the "Saint" issue, the char vector returned by this function
##        needs to be passing to the annealSaintErrors function to fix most
##        (> 90% based on a manual analysis of the 1st 150k lines of the news
##        file) of these errors.
##
## NOTE2: This function took over 22 hrs to run on my quad-core Xeon with
##        16Gb RAM on the twitter 80% training set.
##
## charVect - character vector where every element may contain 1 or more
##            sentences of text
## check.status - the number of lines to process before writing a status
##                message to the console
## Preconditions: This function requires the quanteda package.
breakOutSentences <- function(charVect, check.status=10000) {
    sentenceTokens <- tokenize(charVect, what="sentence")
    sentNormCharVect <- vector(mode = "character")
    counter <- 0
    for(i in 1:length(sentenceTokens)) {
        counter <- counter + 1
        sent.tokenized.line <- sentenceTokens[[i]]
        sentNormCharVect <- append(sentNormCharVect, sent.tokenized.line)
        if(counter == check.status) {
            completed <- (100*i) / length(sentenceTokens)
            cat(i, "breakOutSentences: lines parsed to sentences ",
                completed, "% completed", as.character(Sys.time()), "\n")
            counter <- 0
        }
    }
    
    return(sentNormCharVect)
}

## Repairs (anneals) sentences that were initially parsed improperly across
## the pattern "St. SomeSaintsName".  NOTE: This function took many hours
## to complete on the training data sets.
annealSaintErrors <- function(charVect, status.check=10000) {
    annealedSents <- vector(mode='character')
    next.sent <- ""
    i <- 1
    counter <- 0
    while(i < length(charVect)) {
        counter <- counter + 1
        curr.sent <- charVect[i]
        next.sent <- charVect[i+1]
        hasTerminalSt <- length(grep('(St[.])$', curr.sent)) > 0
        if(hasTerminalSt) {
            # sentence ends with St.: concat w/ following sentence
            annealedSents <- append(annealedSents,
                                    paste(curr.sent, next.sent))
            i <- i + 1
        } else {
            annealedSents <- append(annealedSents, curr.sent)
        }
        i <- i + 1
        if(counter == status.check) {
            completed <- (100*i) / length(charVect)
            cat(i, "annealSaintErrors: lines annealed ",
                completed, "% completed", as.character(Sys.time()), "\n")
            counter <- 0
        }
    }
    annealedSents <- append(annealedSents, next.sent) # add last sentence
    
    return(annealedSents)
}

## Returns the file name of the training set data given fileId which can be
## on of the 3 values: 'blogs', 'news', or 'twitter'. Returns an empty string
## (char vector), if fileId is not one of the 3 expected string values.
getInputDataFileName <- function(fileId) {
    isBlogs <- length(grep(fileId, 'blogs')) > 0
    isNews <- length(grep(fileId, 'news')) > 0
    isTwitter <- length(grep(fileId, 'twitter')) > 0
    if(isBlogs) return(fnames.train[1])
    if(isNews) return(fnames.train[2])
    if(isTwitter) return(fnames.train[3])
    
    return("")
}

## Read inFileName, parses each line into sentences, fixes most of the "Saint"
## parsing errors and writes the results to a file names:
## [original file name].1sents.txt after initial sentence parsing and
## [original file name].2sents.txt after fixing improper sentence breaks across
## the "St. SomeSaintName" tokens.
parseSentsToFile <- function(inFileType,
                             outDataDir=ddir,
                             outFilePostfix1=".1sents.txt",
                             outFilePostfix2=".2sents.txt") {
    
    inFileName <- getInputDataFileName(inFileType)
    outFileName1 <- str_replace(inFileName, '.txt', outFilePostfix1)
    outFileName2 <- str_replace(inFileName, '.txt', outFilePostfix2)
    outFilePath1 <- sprintf("%s%s", outDataDir, outFileName1)
    outFilePath2 <- sprintf("%s%s", outDataDir, outFileName2)
    cat("start parseSentsToFile:", as.character(Sys.time()), "\n")
    cat("processing file:", inFileName, "\n")
    cat("output will be written to:", outFilePath1, "\n")
    
    flines <- getFileLines(fileId=inFileType, dataDir=ddir,
                           fileNames=fnames.train)
    
    flines <- breakOutSentences(flines)
    cat("parseSentsToFile breakOutSentences completed.", "\n")
    writeLines(flines, con = outFilePath1)
    cat("output written to:", outFilePath1, "\n")
    cat("parseSentsToFile annealSaintErrors started...:", as.character(Sys.time()), "\n")
    flines <- annealSaintErrors(flines)
    
    writeLines(flines, con = outFilePath2)
    cat("St. annealed file written to:", outFilePath2, "\n")
    cat("finish parseSentsToFile:", as.character(Sys.time()), "\n")
}

## Removes all the non-ASCII characters from charVect and then returns a
## character vector that contains only ASCII characters.  This function is
## intended to be passed to the runFilterAndWrite function.
convertToAscii <- function(charVect) {
    cat("convertToAscii: start UTF-8 to ASCII conversion...\n")
    charVectAscii <- iconv(charVect, from="UTF-8", to="ASCII")
    charVectAscii <- charVect[-which(is.na(charVectAscii))]
    cat("convertToAscii: finished converting UTF-8 to ASCII.\n")
    return(charVectAscii)
}

## Builds and writes out frquency tables on the unicode tags in the 3ascii.txt
## files
writeUnicodeTagFreqTables <-
    function(index, dataDir="C:/data/dev/PredictNextKBO/data/en_US/") {
        infiles <- c(sprintf('%s%s', dataDir, 'en_US.blogs.train.3ascii.txt'),
                     sprintf('%s%s', dataDir, 'en_US.news.train.3ascii.txt'),
                     sprintf('%s%s', dataDir, 'en_US.twitter.train.3ascii.txt'))
        names(infiles) <- c('blogs', 'news', 'twitter')
        unicodePatter <- "<U[+][A-F0-9]{4}>"
        data <- read_lines(infiles[index])
        ucodes <- unlist(str_extract_all(data, unicodePatter))
        ucodesTable <- sort(table(ucodes), decreasing = TRUE)
        write.csv(data.frame(tag=names(ucodesTable), freq=ucodesTable),
                  sprintf('%s%s', names(infiles[index]), '.utags.csv'), row.names=FALSE)
    }

## Replaces the unicode tag delimiting contractions and plural possesive forms
## with a ASCII single quote character in the character vector charVect, 
## replaces all other unicode tags with spaces, and then returns the updated
## character vector.  This function is intended to be passed to the
## runFilterAndWrite function.
convertUnicodeTags <- function(charVect) {
    cat("convertUnicodeTags: start replacing unicode tags...\n")
    singleQuotePatter <- "([A-Za-z]{1})(<U[+][A-Fa-f0-9]{4}>)(s|d|ve|t|ll|re)"
    unicodePattern <- "<U[+][A-Fa-f0-9]{4}>"
    imFixPattern <- "([Ii])(<U[+][A-Fa-f0-9]{4}>)([mM])"
    charVectContractions <- str_replace_all(charVect, singleQuotePatter,
                                            "\\1'\\3")
    charVectImFix <- str_replace_all(charVectContractions, imFixPattern,
                                     "\\1'\\3")
    # Replace remaining unicode tags with spaces because extra spaces
    # will get cleaned up in a later pre-processing step.
    charVectNoTags <- str_replace_all(charVectImFix, unicodePattern, ' ')
    cat("convertUnicodeTags: FINISHED replacing unicode tags.\n")
    return(charVectNoTags)
}

## Removes most URL's that start with either http, https, or www. from the
## character vector charVect and returns the resulting character vector.
## This function is intended to be passed to the runFilterAndWrite function.
removeUrls <- function(charVect) {
    cat("removeUrls: start removing urls...\n")
    # Build regex to remove URLs. No shorthand character classes in R,
    # so need to create by hand
    wordChars <- "A-Za-z0-9_\\-"
    # urlRegex <- "(http|https)://[\w\-_]+(\.[\w\-_]+)+[\w\-.,@?^=%&:/~\\+#]*"
    urlRegex1 <- sprintf("%s%s%s", "(http|https)(://)[", wordChars, "]+")
    urlRegex2 <- sprintf("%s%s%s", "(\\.[", wordChars, "]+)+")
    urlRegex2 <- sprintf("%s%s%s%s", urlRegex2, "[", wordChars, ".,@?^=%&:/~\\+#]*")
    urlRegex <- sprintf("%s%s", urlRegex1, urlRegex2)
    charVect <- gsub(urlRegex, "", charVect, perl=TRUE)
    # clean up www.<something> instances that don't start with http(s)
    urlRegexWww <- sprintf("%s%s%s%s", "( www\\.)[", wordChars, "]+", urlRegex2)
    charVect <- gsub(urlRegexWww, "", charVect, perl=TRUE)
    cat("removeUrls: FINISHED removing urls.\n")
    
    return(charVect)
}

## Consolidates the tasks of reading data in and writing data out as part of
## filtering or cleaning the data.
## FUN - function to run against the input data
## dataDir - directory where the input is read and the output is written
## inFilePostfix - suffix of input data files that are read in and passed to FUN
## outFilePostfix - suffix of output data files that are written after FUN has
##                  has processed the input.
## filePrefixes - prefixes of the files to be read in and written out
runFilterAndWrite <- function(FUN, dataDir=ddir, inFilePostfix, outFilePostfix,
                              filePrefixes=c('en_US.blogs.train',
                                             'en_US.news.train',
                                             'en_US.twitter.train')) {
    infiles <- c(sprintf('%s%s%s', dataDir, filePrefixes[1], inFilePostfix),
                 sprintf('%s%s%s', dataDir, filePrefixes[2], inFilePostfix),
                 sprintf('%s%s%s', dataDir, filePrefixes[3], inFilePostfix))
    names(infiles) <- c('blogs', 'news', 'twitter')
    outfiles <- c(sprintf('%s%s%s', dataDir, filePrefixes[1], outFilePostfix),
                  sprintf('%s%s%s', dataDir, filePrefixes[2], outFilePostfix),
                  sprintf('%s%s%s', dataDir, filePrefixes[3], outFilePostfix))
    names(outfiles) <- names(infiles)
    
    cat("runFilterAndWrite: start running filter...\n")
    
    for(i in names(infiles)) {
        charVect <- read_lines(infiles[i])
        charVectFiltered <- FUN(charVect)
        writeLines(charVectFiltered, outfiles[i])
    }
    cat("convertUnicodeTags: FINISHED replacing unicode tags.\n")
}

## Removes dashes in a variety of forms.
## charVect - the character vector to have dashes cleaned
## dash.patterns - character vector that determines the type of dash remove to
## be done.  There are 3 valid values: 'suspended', 'leading', or 'trailing'
cleanDashes <- function(charVect,
                        dash.patterns=c('suspended',
                                        'leading',
                                        'trailing')) {
    if(length(grep("suspended", dash.patterns)) > 0) {
        # remove suspended dash
        charVect <- gsub("[ ]+[\\-]+[ ]+", " ", charVect, perl=TRUE)
    }
    if(length(grep("leading", dash.patterns)) > 0) {
        # remove leading dash
        charVect <- gsub("[ ]+[\\-]+", " ", charVect, perl=TRUE)
    }
    if(length(grep("trailing", dash.patterns)) > 0) {
        # remove trailing dash
        charVect <- gsub("[\\-]+[ ]+", " ", charVect, perl=TRUE)
    }
    
    return(charVect)
}

## The 's (prevelant in the twitter file) usually had one of two meanings:
## as either 'his'or 'is'.  Because making distinction was not easy to codify,
## this function simple removes the leading single quote.
tokenizeIsHis <- function(charVect) {
    charVect <- gsub("( 's )", "s", charVect, perl=TRUE)
    return(charVect)
}

## Handles various configuration where single quotes are used.
## charVect - the character vector to have single quotes cleaned
## quote.patterns - character vector that determines the type of single quote
## processing is to be done.  There are 3 valid values: 'suspended', 'leading',
## or 'trailing'
cleanSingleQuotes <- function(charVect,
                              quote.patterns=c('suspended',
                                               'leading',
                                               'trailing')) {
    all.patterns <- (length(grep("suspended", quote.patterns) > 0)) &&
                    (length(grep("leading", quote.patterns) > 0)) &&
                    (length(grep("trailing", quote.patterns) > 0))
    
    if(all.patterns) {
        # deal with case of <something>' '<something else> which occurs in
        # twitter files
        charVect <- gsub("([a-zNUM]+)([']+[ ]+[']+)([a-zNUM]+)", "\\1 \\3", charVect, perl=TRUE)
    }
    
    if(length(grep("suspended", quote.patterns)) > 0) {
        # remove 1 or more suspended single quotes
        charVect <- gsub("[ ]+[']+[ ]+", " ", charVect, perl=TRUE)
    }
    if(length(grep("leading", quote.patterns)) > 0) {
        # replace leading single quote space with space
        charVect <- gsub("[ ]+[']+", " ", charVect, perl=TRUE)
    }
    if(length(grep("trailing", quote.patterns)) > 0) {
        # replace trailing single quote space with space
        charVect <- gsub("[']+[ ]+", " ", charVect, perl=TRUE)
    }
    
    
    return(charVect)
}

## Removes non-essential chars but keep spaces and basic punctuation characters
## such as ?.!,:'- in a somewhat intelligent manner.
preEosClean <- function(charVect) {
    cat("preEosClean: start pre-EOS marker cleaning at",
        as.character(Sys.time()),"...\n")
    # Remove anything that's not an alpha, digit (will replace digits with NUM
    # later), or basic punctuation char. Removes "$%,)(][ characters
    charVect <- gsub("[^A-Za-z0-9?.!?' \\-]", " ", charVect, perl=TRUE)
     
    charVect <- gsub("^[ ]{1,}", "", charVect, perl=TRUE)  # remove leading spaces
    charVect <- gsub("[ ]{1,}$", "", charVect, perl=TRUE)  # remove trailing spaces
    # remove non-alpha char's that start sentences
    charVect <- gsub("^[^A-Za-z]+", "", charVect)
    # make lines that don't end in . ! or ? empty so they'll be removed later
    charVect <- gsub("^.*[^.!?]$", "", charVect)
    # replace non-word-period by just period
    charVect <- gsub("([^A-Za-z0-9]+.)$", ".", charVect)
    # remove lines that don't have any alpha characters
    charVect <- gsub("^[^A-Za-z]+$", "", charVect, perl=TRUE)
    
    charVect <- cleanDashes(charVect)
    charVect <- tokenizeIsHis(charVect) # 's to s
    charVect <- cleanSingleQuotes(charVect)
    
    # remove periods that start lines
    charVect <- gsub("^[.]+", "", charVect, ignore.case=TRUE, perl=TRUE)
    # remove embedded periods
    charVect <- gsub("([a-z]+)([.]+)([a-z]+)", "\\1 \\3", charVect,
                     ignore.case=TRUE, perl=TRUE)
    # replace space-period-space with just space
    charVect <- gsub(" [.] ", " ", charVect, ignore.case=TRUE, perl=TRUE)
    # remove periods assoc'd w/ morning and evening time abbrev's
    charVect <- gsub("(a m.)", "am", charVect, ignore.case=TRUE, perl=TRUE)
    charVect <- gsub("(p m.)", "pm", charVect, ignore.case=TRUE, perl=TRUE)

    # remove empty lines
    charVect <- charVect[which(charVect  != "")]
    # replace 2 or more spaces with a single space
    charVect <- gsub("[ ]{2,}", " ", charVect, perl=TRUE)
    # normalize text to lower case
    charVect <- tolower(charVect)
    # replace sequences of digits by NUM token: after lower case to keep
    # this special token UPPER CASE in the processed file
    charVect <- gsub("[0-9]+", "NUM", charVect)
    # replace two or more periods in a row with a space
    charVect <- gsub("[.]{2,}", " ", charVect, perl=TRUE)
    # replace space followed by 1 or more minus signs followed by space
    # with just a space
    charVect <- gsub("[ ]+[-]+[ ]", " ", charVect, perl=TRUE)
    # replace phrases enclosed in single quotes with just the phrase
    charVect <- gsub("(?:[^a-zM]+)'([a-zNUM ]+)'(?:[^a-zN])",
                     " \\1 ", charVect, perl=TRUE)
    # replace two or more spaces in a row (again) with a space to clean up any
    # extra spaces generated by the prior gsub calls
    charVect <- gsub("[ ]{2,}", " ", charVect, perl=TRUE)
    
    # repeat to handle some embeddedness
    charVect <- cleanDashes(charVect)
    charVect <- cleanSingleQuotes(charVect)
    # remove remaining non-alpha fragments
    charVect <- gsub("[-']{2,}", " ", charVect, perl=TRUE)
    
    cat("preEosClean: FINISHED pre-EOS marker cleaning at",
        as.character(Sys.time()),".\n")
    return(charVect)
}

## Adds EOS at the end of each line of text and replaces the . ! or ?
## charVect - character vector containing the file lines to be processed
addEosMarkers <- function(charVect) {
    cat("addEosMarkers: start adding EOS markers at",
        as.character(Sys.time()),"...\n")
    # add end of sentence markers to lines that ended in non-alpha chars
    charVect <- gsub("([^a-z]+)$", " EOS", charVect, ignore.case=TRUE, perl=TRUE)
    # remove all remaining periods, question marks, exclamation points:
    charVect <- gsub("[.?!]", " ", charVect, perl=TRUE)
    charVect <- gsub("[ ]{2,}", " ", charVect)  # remove extra spaces
    
    cat("addEosMarkers: FINISH adding EOS markers at",
        as.character(Sys.time()),".\n")
    return(charVect)
}

## Fixes varous issues with common acronyms such as tv, us (United States), uk
## (United Kingdom), dc (District of Columbia), ie, eg
## Precondition - This function should be called AFTER all text has been
##                converted to lower case (u s to US conversion)
## Note: Examples like line 1673063 in en_US.blogs.train.8posteos.txt show how
##       even carefully crafted regexs are going to convert some segments
##       incorrectly. Other misclassifications also exist.
postEosClean <- function(charVect) {
    # tv
    charVect <- gsub("([a-zNUM]+)( t v )([a-zNUMEOS']+)", "\\1 tv \\3",
                     charVect, perl=TRUE)
    # us - A vast majority of instances of ' u s ' should be interpretted as US.
    # A vast majority of instances of  'us' should be the word us (not an acro)
    charVect <- gsub("( u s -)([a-zEOSNUM']+) ", " US-\\2 ", charVect, perl=TRUE)
    charVect <- gsub("([a-zNUM]+)( u s )([a-zNUMEOS']+)", "\\1 US \\3",
                     charVect, perl=TRUE)
    charVect <- gsub("the us", "the US", charVect, perl=TRUE)
    # uk
    charVect <- gsub("( u k -)([a-zEOSNUM]+) ", " uk-\\2 ", charVect, perl=TRUE)
    charVect <- gsub("([a-zNUM]+)( u k )([a-zNUMEOS']+)", "\\1 uk \\3",
                     charVect, perl=TRUE)
    # la
    charVect <- gsub("( l a -)([a-zEOSNUM]+) ", " la-\\2 ", charVect, perl=TRUE)
    charVect <- gsub("([a-zNUM]+)( l a )([a-zNUMEOS']+)", "\\1 la \\3",
                     charVect, perl=TRUE)
    # dc
    charVect <- gsub("( d c -)([a-zEOSNUM]+) ", " DC-\\2 ", charVect, perl=TRUE)
    charVect <- gsub("([a-zNUM]+)( d c )([a-zNUMEOS']+)", "\\1 DC \\3",
                     charVect, perl=TRUE)
    # <something> -based
    charVect <- gsub(" ([a-z]{2,}) -([a-zEOSNUM]+) ", " \\1-\\2 ", charVect, perl=TRUE)
    # ie, eg
    charVect <- gsub("([a-zNUM]+)( i e )([a-zNUMEOS]+)", "\\1 ie \\3",
                     charVect, perl=TRUE)
    charVect <- gsub("([a-zNUM]+)( e g )([a-zNUMEOS]+)", "\\1 eg \\3",
                     charVect, perl=TRUE)
    # cd - things like 'vitamin c d' will not get processed correctly, dealing
    # with every possible contingency in the corpus is unrealstic, so some
    # mis-processing is going to have to be tolerated
    charVect <- gsub("([a-zNUM]+)( c d )([a-zNUMEOS']+)", "\\1 cd \\3",
                     charVect, perl=TRUE)
    # concatenate single letters together - assume they should be
    charVect <- gsub(" ([a-z]) ([a-z]) ", " \\1\\2 ", charVect, perl=TRUE)
    # push together <something> 's
    charVect <- gsub("([a-zA-Z']) ('s) ", " \\1\\2 ", charVect, perl=TRUE)
    # -<somthing> clean up
    # charVect <- str_replace_all(charVect, " -([a-zNUMEOS'-]+) ", " \\1 ")
    charVect <- str_replace_all(charVect, " -([a-zNUMEOS'-]+)", " \\1")
    # '<something> clean up
    charVect <- gsub(" '([a-zA-z]+)", " \\1", charVect, perl=TRUE)
    # clean up misc fragments generated from prior op's
    charVect <- cleanDashes(charVect, 'suspended')
    charVect <- cleanSingleQuotes(charVect, 'suspended')
    charVect <- cleanDashes(charVect, 'suspended')
    
    return(charVect)
}