World wide mobile internet usage is projected to continue its rapid growth over the next few years. According to a Statista Fact Sheet, the percentage of mobile phone users accessing the internet will rise to 63.4% in 2019 up from 48.8% in 2014. This increased ownship has resulted in more people spending increasing amounts of time on mobile devices for email, social networking, banking and other activities. Because typing on these devices is an awkward and tedious task, smart keyboard applications based on predictive text analytics have emerged to make typing easier.
The overall goal of this project was to develop a prototype predictive web application that suggests the next word in a message of text based on what has been typed in by the user. For example, a user may type I love Italian and the the application might suggest: food, shoes, or opera. This goal was broken down into four objectives:
In addition to the objective, this document describes the details around how the data was partitioned, how it was cleaned prior to n-gram table construction, and what exploratory data analysis (EDA) was done. This was the most time objective which was no surprise given what has been reported in the literature.
The data was initially split into an 80% training set and a 20% test set. All cleaning, exploratory data analysis (EDA) and parameter optimization work was performed on the training set. All code used to perform any analysis step is listed in the Appendix and referred to throughout this document.
Once the data was cleaned, unigram, bigram, and trigram frequency tables were constructed. These tables are at the heart of the data required by the model to make its predictions. Some EDA was performed on these tables to gain insights which would later serve the model development phase.
A Katz Back-Off (KBO) Trigram Language Model (LM) was selected as the algorithm to make these predictions. A detailed description of how this algorithm works can be found here. The KBO Trigram was chosen for three main reasons:
As described here, the KBO Trigram has two parameters: the bigram discount rate (\(\gamma_2\)) and the trigram discount rate (\(\gamma_3\)). The default values for both of these parameters was set to 0.5 in the web app, but values which improved the accuracy were obtained using cross-validation. Better performing values for these parameters and the process by which they were obtained will be available here when this work has been completed:
This objective was addressed concurrently while working on each of the prior objectives by making sure that each function was properly documented, that all code was made available, and links to all output files were provided. All pre-processing code (Objective #1) is provided in the Appendix section of this R Markdown document.
The results of meeting this objective can be found by visiting:
The data was originally downloaded from this link and stored locally. If this link is no longer available, the data can obtained from my dropbox. The zip file was download to a directory called data in a local project and unzipped there. The the unzipped data contained four subdirectories: de_DE (German), en_US (US english), fi_FI (Finnish), and ru_RU (Russian). This project focuses on the English corpora residing in the en_US folder. This folder contains three files named en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
With the data residing locally, the writeTrainTestFiles function was used to read each of the three data files, partition each them into 80% training and 20% hold-out test sets, and then write these two partitioned files locally. The training set was used to train the model to determine values for the two model parameters: bigram discount rate and trigram discount rate. Descriptions of what these parameters are be found here.
After the data was read and partitioned into training and test sets, the training files were cleaned using the following steps:
The above steps were executed off-line (not in this R Markdown file) because they were time intensive. To speed future processing, the three cleaned en_US files were rewritten back to the local file system so they could be read in again later.
The parseSentsToFile function was used to parse the data into a single sentence per line and then write the resulting files out for later processing. Two files per each of the original data files were generated at this step. These files were:
Files ending in 1sents.txt are the files initially parsed into sentences before fixing the issue of improper sentence breaks between the tokens St. SomeSaintsName. Files ending in 2sents.txt are the files that were ran through the annealSaintErrors function to fix these errors.
Because the annealSaintErrors function didn’t fix all the St. SomeSaintsName parse errors, each 2sents.txt file was opened on Windows 10 in NotePad++ v6.9.2 aka NP++ and the following steps were performed:
(St[.])(\r\n)([A-Z]+)\1 \3If future versions of NP++ do not behave the same as described for v6.9.2, this version can be downloaded here
Additional filtering was accomplished using a variety of functions which are described below. To make the filtering process more manageable, the runFilterAndWrite function was created to take a function as a parameter, run that function against an input file, and then write the results to a file. This reduced duplicate code by consolidating all the code that does file IO to a single function.
After completion of sentence parsing, the resulting files were UTF-8 encoded but still contained some unusual characters. To clean out these characters, each file was ran through the function convertToAscii by passing it to runFilterAndWrite. The convertToAscii function passed the contents through an ASCII encoder and then rewrote out the contents with a .3ascii.txt suffix. Links to these files are provided below:
This step removed all lines that contained unusual characters and resulted in removing 29 lines from the blogs file, 14 lines from the news, file and only 5 lines from the twitter file. However, these files still containted a lot of tags of the form <U+hhhh> where h is a hex digit from 0 to F. These were unicode tags that were generated during the data partitioning step when the partitioned files were written back out. As a result, the next challenge was figuring out how to deal with these tags. The following four options were considered:
Options 1. and 2. were ruled out because there were so many of these tokens: over 600k in the blogs file, over 200k in the news file, and over 100k in the twitter file, and keeping them would have distorted the n-gram tables while removing the lines would have reduced the n-gram counts. Both results would have reduced the quality of the n-gram tables used by the model which would degrade its accuracy.
In evaluating the remaining two options, frequency tables were constructed using the writeUnicodeTagFreqTables function. The resulting tables were:
A vast majority of these tags were <U+FFFD> which is a special code used to replace an unknown or unrepresentable character. Upon manual inspection, it appeared that the most frequent use of this character which we would want to preserve was as a single quote in a contraction such as isn’t or doesn’t or as a plural possesive form such as Tom’s or Mary’s. The convertUnicodeTags function was created and passed to runFilterAndWrite in order to replace unicodes tags with either a single quote or a space (which gets cleaned up in a later step). The outputs from this function resulted in the following 3 files:
The removeUrls function was passed to runFilterAndWrite to remove the URL’s. Several of the most common forms were designed to be captured, and the frequency of ones that were not captured were deemed acceptable low. The outputs from this function resulted in the following 3 files:
The preEosClean function was passed to runFilterAndWrite to do the filtering prior to EOS marker insertion. This step in the pipeline removed things like unneeded dashes and single quotes. The outputs from this function resulted in the following 3 files:
The addEosMarkers function was passed to runFilterAndWrite to do the EOS token insertion. The outputs from this function resulted in the following 3 files:
The postEosClean function was passed to runFilterAndWrite to do the filtering after to EOS marker insertion. The outputs from this function resulted in the following 3 files:
The next steps in the data analysis pipeline are described in Predicting Next Word Using Katz Back-Off: Part 2 - N-grams and Exploratory Data Analysis.
# install packages if needed
list.of.packages <- c('dplyr', 'readr', 'stringr', 'quanteda')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages) > 0) install.packages(new.packages)
# load libraries
# libs <- c('dplyr', 'readr', 'quanteda')
lapply(list.of.packages, require, character.only=TRUE) # load libs
options(stringsAsFactors = FALSE) # strings are what we are operating on...
# set parameters
ddir <- "../data/en_US/" # assumes exec from dir at same level as data
fnames <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
fnames.train <- c("en_US.blogs.train.txt", "en_US.news.train.txt",
"en_US.twitter.train.txt")
## Reads the text corpus data file and returns a character array where every
## element is a line from the file.
## fileId = string, text fragment of file name to be read e.g. 'blogs', 'news',
## or 'twit'
## dataDir = path to data file to be read
## fnames = file names to be read which have fileId fragments
getFileLines <- function(fileId, dataDir=ddir, fileNames=fnames) {
if(grep(fileId, fnames) > 0) index <- grep(fileId, fnames)
else {
cat('getFileLines could undestand what file to read:', fileId)
return(NULL)
}
fileLines <- read_lines(sprintf("%s%s", dataDir, fnames[index]))
return(fileLines)
}
## Breaks the en_US.<fileType>.txt into training and test sets and writes out
## these files.
## fileType - string, one of 3 values: 'blogs', 'news', or 'twitter'
## train.fraction - float between 0 and 1, fractional amount of data to be used
## in the training set
## dataDir - relative path to the data directory
writeTrainTestFiles <- function(fileType, train.fraction=0.8,
dataDir=ddir) {
set.seed(71198)
prefix <- "en_US."
in.postfix <- ".txt"
train.postfix <- ".train.txt"
test.postfix <- ".test.txt"
infile <- sprintf("%s%s%s%s", dataDir, prefix, fileType, in.postfix)
dat <- getFileLines(fileType)
line.count <- length(dat)
train.size <- as.integer(train.fraction * line.count)
test.size <- line.count - train.size
train.indices <- sample(1:line.count, train.size, replace=FALSE)
train.indices <- train.indices[order(train.indices)]
test.indices <- setdiff(1:line.count, train.indices)
train.set <- dat[train.indices]
ofile <- sprintf('%s%s%s%s', dataDir, prefix, fileType, train.postfix)
writeLines(train.set, ofile)
test.set <- dat[test.indices]
ofile <- sprintf('%s%s%s%s', dataDir, prefix, fileType, test.postfix)
writeLines(test.set, ofile)
# return(list(train=train.indices, test=test.indices))
}
## Returns a character vector where every element is a sentence of text.
##
## NOTE1: This function will improperly parse "St. Something" into 2 sentences.
## It makes other mistakes (e.g. Ph.D.) which one could spend a crazy amount of time
## fixing, but these others errors are ignored in the interest of time.
##
## To fix the "Saint" issue, the char vector returned by this function
## needs to be passing to the annealSaintErrors function to fix most
## (> 90% based on a manual analysis of the 1st 150k lines of the news
## file) of these errors.
##
## NOTE2: This function took over 22 hrs to run on my quad-core Xeon with
## 16Gb RAM on the twitter 80% training set.
##
## charVect - character vector where every element may contain 1 or more
## sentences of text
## check.status - the number of lines to process before writing a status
## message to the console
## Preconditions: This function requires the quanteda package.
breakOutSentences <- function(charVect, check.status=10000) {
sentenceTokens <- tokenize(charVect, what="sentence")
sentNormCharVect <- vector(mode = "character")
counter <- 0
for(i in 1:length(sentenceTokens)) {
counter <- counter + 1
sent.tokenized.line <- sentenceTokens[[i]]
sentNormCharVect <- append(sentNormCharVect, sent.tokenized.line)
if(counter == check.status) {
completed <- (100*i) / length(sentenceTokens)
cat(i, "breakOutSentences: lines parsed to sentences ",
completed, "% completed", as.character(Sys.time()), "\n")
counter <- 0
}
}
return(sentNormCharVect)
}
## Repairs (anneals) sentences that were initially parsed improperly across
## the pattern "St. SomeSaintsName". NOTE: This function took many hours
## to complete on the training data sets.
annealSaintErrors <- function(charVect, status.check=10000) {
annealedSents <- vector(mode='character')
next.sent <- ""
i <- 1
counter <- 0
while(i < length(charVect)) {
counter <- counter + 1
curr.sent <- charVect[i]
next.sent <- charVect[i+1]
hasTerminalSt <- length(grep('(St[.])$', curr.sent)) > 0
if(hasTerminalSt) {
# sentence ends with St.: concat w/ following sentence
annealedSents <- append(annealedSents,
paste(curr.sent, next.sent))
i <- i + 1
} else {
annealedSents <- append(annealedSents, curr.sent)
}
i <- i + 1
if(counter == status.check) {
completed <- (100*i) / length(charVect)
cat(i, "annealSaintErrors: lines annealed ",
completed, "% completed", as.character(Sys.time()), "\n")
counter <- 0
}
}
annealedSents <- append(annealedSents, next.sent) # add last sentence
return(annealedSents)
}
## Returns the file name of the training set data given fileId which can be
## on of the 3 values: 'blogs', 'news', or 'twitter'. Returns an empty string
## (char vector), if fileId is not one of the 3 expected string values.
getInputDataFileName <- function(fileId) {
isBlogs <- length(grep(fileId, 'blogs')) > 0
isNews <- length(grep(fileId, 'news')) > 0
isTwitter <- length(grep(fileId, 'twitter')) > 0
if(isBlogs) return(fnames.train[1])
if(isNews) return(fnames.train[2])
if(isTwitter) return(fnames.train[3])
return("")
}
## Read inFileName, parses each line into sentences, fixes most of the "Saint"
## parsing errors and writes the results to a file names:
## [original file name].1sents.txt after initial sentence parsing and
## [original file name].2sents.txt after fixing improper sentence breaks across
## the "St. SomeSaintName" tokens.
parseSentsToFile <- function(inFileType,
outDataDir=ddir,
outFilePostfix1=".1sents.txt",
outFilePostfix2=".2sents.txt") {
inFileName <- getInputDataFileName(inFileType)
outFileName1 <- str_replace(inFileName, '.txt', outFilePostfix1)
outFileName2 <- str_replace(inFileName, '.txt', outFilePostfix2)
outFilePath1 <- sprintf("%s%s", outDataDir, outFileName1)
outFilePath2 <- sprintf("%s%s", outDataDir, outFileName2)
cat("start parseSentsToFile:", as.character(Sys.time()), "\n")
cat("processing file:", inFileName, "\n")
cat("output will be written to:", outFilePath1, "\n")
flines <- getFileLines(fileId=inFileType, dataDir=ddir,
fileNames=fnames.train)
flines <- breakOutSentences(flines)
cat("parseSentsToFile breakOutSentences completed.", "\n")
writeLines(flines, con = outFilePath1)
cat("output written to:", outFilePath1, "\n")
cat("parseSentsToFile annealSaintErrors started...:", as.character(Sys.time()), "\n")
flines <- annealSaintErrors(flines)
writeLines(flines, con = outFilePath2)
cat("St. annealed file written to:", outFilePath2, "\n")
cat("finish parseSentsToFile:", as.character(Sys.time()), "\n")
}
## Removes all the non-ASCII characters from charVect and then returns a
## character vector that contains only ASCII characters. This function is
## intended to be passed to the runFilterAndWrite function.
convertToAscii <- function(charVect) {
cat("convertToAscii: start UTF-8 to ASCII conversion...\n")
charVectAscii <- iconv(charVect, from="UTF-8", to="ASCII")
charVectAscii <- charVect[-which(is.na(charVectAscii))]
cat("convertToAscii: finished converting UTF-8 to ASCII.\n")
return(charVectAscii)
}
## Builds and writes out frquency tables on the unicode tags in the 3ascii.txt
## files
writeUnicodeTagFreqTables <-
function(index, dataDir="C:/data/dev/PredictNextKBO/data/en_US/") {
infiles <- c(sprintf('%s%s', dataDir, 'en_US.blogs.train.3ascii.txt'),
sprintf('%s%s', dataDir, 'en_US.news.train.3ascii.txt'),
sprintf('%s%s', dataDir, 'en_US.twitter.train.3ascii.txt'))
names(infiles) <- c('blogs', 'news', 'twitter')
unicodePatter <- "<U[+][A-F0-9]{4}>"
data <- read_lines(infiles[index])
ucodes <- unlist(str_extract_all(data, unicodePatter))
ucodesTable <- sort(table(ucodes), decreasing = TRUE)
write.csv(data.frame(tag=names(ucodesTable), freq=ucodesTable),
sprintf('%s%s', names(infiles[index]), '.utags.csv'), row.names=FALSE)
}
## Replaces the unicode tag delimiting contractions and plural possesive forms
## with a ASCII single quote character in the character vector charVect,
## replaces all other unicode tags with spaces, and then returns the updated
## character vector. This function is intended to be passed to the
## runFilterAndWrite function.
convertUnicodeTags <- function(charVect) {
cat("convertUnicodeTags: start replacing unicode tags...\n")
singleQuotePatter <- "([A-Za-z]{1})(<U[+][A-Fa-f0-9]{4}>)(s|d|ve|t|ll|re)"
unicodePattern <- "<U[+][A-Fa-f0-9]{4}>"
imFixPattern <- "([Ii])(<U[+][A-Fa-f0-9]{4}>)([mM])"
charVectContractions <- str_replace_all(charVect, singleQuotePatter,
"\\1'\\3")
charVectImFix <- str_replace_all(charVectContractions, imFixPattern,
"\\1'\\3")
# Replace remaining unicode tags with spaces because extra spaces
# will get cleaned up in a later pre-processing step.
charVectNoTags <- str_replace_all(charVectImFix, unicodePattern, ' ')
cat("convertUnicodeTags: FINISHED replacing unicode tags.\n")
return(charVectNoTags)
}
## Removes most URL's that start with either http, https, or www. from the
## character vector charVect and returns the resulting character vector.
## This function is intended to be passed to the runFilterAndWrite function.
removeUrls <- function(charVect) {
cat("removeUrls: start removing urls...\n")
# Build regex to remove URLs. No shorthand character classes in R,
# so need to create by hand
wordChars <- "A-Za-z0-9_\\-"
# urlRegex <- "(http|https)://[\w\-_]+(\.[\w\-_]+)+[\w\-.,@?^=%&:/~\\+#]*"
urlRegex1 <- sprintf("%s%s%s", "(http|https)(://)[", wordChars, "]+")
urlRegex2 <- sprintf("%s%s%s", "(\\.[", wordChars, "]+)+")
urlRegex2 <- sprintf("%s%s%s%s", urlRegex2, "[", wordChars, ".,@?^=%&:/~\\+#]*")
urlRegex <- sprintf("%s%s", urlRegex1, urlRegex2)
charVect <- gsub(urlRegex, "", charVect, perl=TRUE)
# clean up www.<something> instances that don't start with http(s)
urlRegexWww <- sprintf("%s%s%s%s", "( www\\.)[", wordChars, "]+", urlRegex2)
charVect <- gsub(urlRegexWww, "", charVect, perl=TRUE)
cat("removeUrls: FINISHED removing urls.\n")
return(charVect)
}
## Consolidates the tasks of reading data in and writing data out as part of
## filtering or cleaning the data.
## FUN - function to run against the input data
## dataDir - directory where the input is read and the output is written
## inFilePostfix - suffix of input data files that are read in and passed to FUN
## outFilePostfix - suffix of output data files that are written after FUN has
## has processed the input.
## filePrefixes - prefixes of the files to be read in and written out
runFilterAndWrite <- function(FUN, dataDir=ddir, inFilePostfix, outFilePostfix,
filePrefixes=c('en_US.blogs.train',
'en_US.news.train',
'en_US.twitter.train')) {
infiles <- c(sprintf('%s%s%s', dataDir, filePrefixes[1], inFilePostfix),
sprintf('%s%s%s', dataDir, filePrefixes[2], inFilePostfix),
sprintf('%s%s%s', dataDir, filePrefixes[3], inFilePostfix))
names(infiles) <- c('blogs', 'news', 'twitter')
outfiles <- c(sprintf('%s%s%s', dataDir, filePrefixes[1], outFilePostfix),
sprintf('%s%s%s', dataDir, filePrefixes[2], outFilePostfix),
sprintf('%s%s%s', dataDir, filePrefixes[3], outFilePostfix))
names(outfiles) <- names(infiles)
cat("runFilterAndWrite: start running filter...\n")
for(i in names(infiles)) {
charVect <- read_lines(infiles[i])
charVectFiltered <- FUN(charVect)
writeLines(charVectFiltered, outfiles[i])
}
cat("convertUnicodeTags: FINISHED replacing unicode tags.\n")
}
## Removes dashes in a variety of forms.
## charVect - the character vector to have dashes cleaned
## dash.patterns - character vector that determines the type of dash remove to
## be done. There are 3 valid values: 'suspended', 'leading', or 'trailing'
cleanDashes <- function(charVect,
dash.patterns=c('suspended',
'leading',
'trailing')) {
if(length(grep("suspended", dash.patterns)) > 0) {
# remove suspended dash
charVect <- gsub("[ ]+[\\-]+[ ]+", " ", charVect, perl=TRUE)
}
if(length(grep("leading", dash.patterns)) > 0) {
# remove leading dash
charVect <- gsub("[ ]+[\\-]+", " ", charVect, perl=TRUE)
}
if(length(grep("trailing", dash.patterns)) > 0) {
# remove trailing dash
charVect <- gsub("[\\-]+[ ]+", " ", charVect, perl=TRUE)
}
return(charVect)
}
## The 's (prevelant in the twitter file) usually had one of two meanings:
## as either 'his'or 'is'. Because making distinction was not easy to codify,
## this function simple removes the leading single quote.
tokenizeIsHis <- function(charVect) {
charVect <- gsub("( 's )", "s", charVect, perl=TRUE)
return(charVect)
}
## Handles various configuration where single quotes are used.
## charVect - the character vector to have single quotes cleaned
## quote.patterns - character vector that determines the type of single quote
## processing is to be done. There are 3 valid values: 'suspended', 'leading',
## or 'trailing'
cleanSingleQuotes <- function(charVect,
quote.patterns=c('suspended',
'leading',
'trailing')) {
all.patterns <- (length(grep("suspended", quote.patterns) > 0)) &&
(length(grep("leading", quote.patterns) > 0)) &&
(length(grep("trailing", quote.patterns) > 0))
if(all.patterns) {
# deal with case of <something>' '<something else> which occurs in
# twitter files
charVect <- gsub("([a-zNUM]+)([']+[ ]+[']+)([a-zNUM]+)", "\\1 \\3", charVect, perl=TRUE)
}
if(length(grep("suspended", quote.patterns)) > 0) {
# remove 1 or more suspended single quotes
charVect <- gsub("[ ]+[']+[ ]+", " ", charVect, perl=TRUE)
}
if(length(grep("leading", quote.patterns)) > 0) {
# replace leading single quote space with space
charVect <- gsub("[ ]+[']+", " ", charVect, perl=TRUE)
}
if(length(grep("trailing", quote.patterns)) > 0) {
# replace trailing single quote space with space
charVect <- gsub("[']+[ ]+", " ", charVect, perl=TRUE)
}
return(charVect)
}
## Removes non-essential chars but keep spaces and basic punctuation characters
## such as ?.!,:'- in a somewhat intelligent manner.
preEosClean <- function(charVect) {
cat("preEosClean: start pre-EOS marker cleaning at",
as.character(Sys.time()),"...\n")
# Remove anything that's not an alpha, digit (will replace digits with NUM
# later), or basic punctuation char. Removes "$%,)(][ characters
charVect <- gsub("[^A-Za-z0-9?.!?' \\-]", " ", charVect, perl=TRUE)
charVect <- gsub("^[ ]{1,}", "", charVect, perl=TRUE) # remove leading spaces
charVect <- gsub("[ ]{1,}$", "", charVect, perl=TRUE) # remove trailing spaces
# remove non-alpha char's that start sentences
charVect <- gsub("^[^A-Za-z]+", "", charVect)
# make lines that don't end in . ! or ? empty so they'll be removed later
charVect <- gsub("^.*[^.!?]$", "", charVect)
# replace non-word-period by just period
charVect <- gsub("([^A-Za-z0-9]+.)$", ".", charVect)
# remove lines that don't have any alpha characters
charVect <- gsub("^[^A-Za-z]+$", "", charVect, perl=TRUE)
charVect <- cleanDashes(charVect)
charVect <- tokenizeIsHis(charVect) # 's to s
charVect <- cleanSingleQuotes(charVect)
# remove periods that start lines
charVect <- gsub("^[.]+", "", charVect, ignore.case=TRUE, perl=TRUE)
# remove embedded periods
charVect <- gsub("([a-z]+)([.]+)([a-z]+)", "\\1 \\3", charVect,
ignore.case=TRUE, perl=TRUE)
# replace space-period-space with just space
charVect <- gsub(" [.] ", " ", charVect, ignore.case=TRUE, perl=TRUE)
# remove periods assoc'd w/ morning and evening time abbrev's
charVect <- gsub("(a m.)", "am", charVect, ignore.case=TRUE, perl=TRUE)
charVect <- gsub("(p m.)", "pm", charVect, ignore.case=TRUE, perl=TRUE)
# remove empty lines
charVect <- charVect[which(charVect != "")]
# replace 2 or more spaces with a single space
charVect <- gsub("[ ]{2,}", " ", charVect, perl=TRUE)
# normalize text to lower case
charVect <- tolower(charVect)
# replace sequences of digits by NUM token: after lower case to keep
# this special token UPPER CASE in the processed file
charVect <- gsub("[0-9]+", "NUM", charVect)
# replace two or more periods in a row with a space
charVect <- gsub("[.]{2,}", " ", charVect, perl=TRUE)
# replace space followed by 1 or more minus signs followed by space
# with just a space
charVect <- gsub("[ ]+[-]+[ ]", " ", charVect, perl=TRUE)
# replace phrases enclosed in single quotes with just the phrase
charVect <- gsub("(?:[^a-zM]+)'([a-zNUM ]+)'(?:[^a-zN])",
" \\1 ", charVect, perl=TRUE)
# replace two or more spaces in a row (again) with a space to clean up any
# extra spaces generated by the prior gsub calls
charVect <- gsub("[ ]{2,}", " ", charVect, perl=TRUE)
# repeat to handle some embeddedness
charVect <- cleanDashes(charVect)
charVect <- cleanSingleQuotes(charVect)
# remove remaining non-alpha fragments
charVect <- gsub("[-']{2,}", " ", charVect, perl=TRUE)
cat("preEosClean: FINISHED pre-EOS marker cleaning at",
as.character(Sys.time()),".\n")
return(charVect)
}
## Adds EOS at the end of each line of text and replaces the . ! or ?
## charVect - character vector containing the file lines to be processed
addEosMarkers <- function(charVect) {
cat("addEosMarkers: start adding EOS markers at",
as.character(Sys.time()),"...\n")
# add end of sentence markers to lines that ended in non-alpha chars
charVect <- gsub("([^a-z]+)$", " EOS", charVect, ignore.case=TRUE, perl=TRUE)
# remove all remaining periods, question marks, exclamation points:
charVect <- gsub("[.?!]", " ", charVect, perl=TRUE)
charVect <- gsub("[ ]{2,}", " ", charVect) # remove extra spaces
cat("addEosMarkers: FINISH adding EOS markers at",
as.character(Sys.time()),".\n")
return(charVect)
}
## Fixes varous issues with common acronyms such as tv, us (United States), uk
## (United Kingdom), dc (District of Columbia), ie, eg
## Precondition - This function should be called AFTER all text has been
## converted to lower case (u s to US conversion)
## Note: Examples like line 1673063 in en_US.blogs.train.8posteos.txt show how
## even carefully crafted regexs are going to convert some segments
## incorrectly. Other misclassifications also exist.
postEosClean <- function(charVect) {
# tv
charVect <- gsub("([a-zNUM]+)( t v )([a-zNUMEOS']+)", "\\1 tv \\3",
charVect, perl=TRUE)
# us - A vast majority of instances of ' u s ' should be interpretted as US.
# A vast majority of instances of 'us' should be the word us (not an acro)
charVect <- gsub("( u s -)([a-zEOSNUM']+) ", " US-\\2 ", charVect, perl=TRUE)
charVect <- gsub("([a-zNUM]+)( u s )([a-zNUMEOS']+)", "\\1 US \\3",
charVect, perl=TRUE)
charVect <- gsub("the us", "the US", charVect, perl=TRUE)
# uk
charVect <- gsub("( u k -)([a-zEOSNUM]+) ", " uk-\\2 ", charVect, perl=TRUE)
charVect <- gsub("([a-zNUM]+)( u k )([a-zNUMEOS']+)", "\\1 uk \\3",
charVect, perl=TRUE)
# la
charVect <- gsub("( l a -)([a-zEOSNUM]+) ", " la-\\2 ", charVect, perl=TRUE)
charVect <- gsub("([a-zNUM]+)( l a )([a-zNUMEOS']+)", "\\1 la \\3",
charVect, perl=TRUE)
# dc
charVect <- gsub("( d c -)([a-zEOSNUM]+) ", " DC-\\2 ", charVect, perl=TRUE)
charVect <- gsub("([a-zNUM]+)( d c )([a-zNUMEOS']+)", "\\1 DC \\3",
charVect, perl=TRUE)
# <something> -based
charVect <- gsub(" ([a-z]{2,}) -([a-zEOSNUM]+) ", " \\1-\\2 ", charVect, perl=TRUE)
# ie, eg
charVect <- gsub("([a-zNUM]+)( i e )([a-zNUMEOS]+)", "\\1 ie \\3",
charVect, perl=TRUE)
charVect <- gsub("([a-zNUM]+)( e g )([a-zNUMEOS]+)", "\\1 eg \\3",
charVect, perl=TRUE)
# cd - things like 'vitamin c d' will not get processed correctly, dealing
# with every possible contingency in the corpus is unrealstic, so some
# mis-processing is going to have to be tolerated
charVect <- gsub("([a-zNUM]+)( c d )([a-zNUMEOS']+)", "\\1 cd \\3",
charVect, perl=TRUE)
# concatenate single letters together - assume they should be
charVect <- gsub(" ([a-z]) ([a-z]) ", " \\1\\2 ", charVect, perl=TRUE)
# push together <something> 's
charVect <- gsub("([a-zA-Z']) ('s) ", " \\1\\2 ", charVect, perl=TRUE)
# -<somthing> clean up
# charVect <- str_replace_all(charVect, " -([a-zNUMEOS'-]+) ", " \\1 ")
charVect <- str_replace_all(charVect, " -([a-zNUMEOS'-]+)", " \\1")
# '<something> clean up
charVect <- gsub(" '([a-zA-z]+)", " \\1", charVect, perl=TRUE)
# clean up misc fragments generated from prior op's
charVect <- cleanDashes(charVect, 'suspended')
charVect <- cleanSingleQuotes(charVect, 'suspended')
charVect <- cleanDashes(charVect, 'suspended')
return(charVect)
}