library(tm)
library(RWeka)
The source data used for this project exists in the following URL:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Downloaded file has been unzipped in the working directory, and the following function used to read the corpora from the given folder/file structure into vector of strings. It written to be general in terms of:
loading.data <- function(localization, charset="UTF-8", src=c("news", "blogs", "twitter"), samples=-1){
corpora <- c()
# loop for each dataset/source
for(source in src){
# build up the path of the file to be loaded (e.g. "./final/en_US/en_US.blogs.txt")
source.file <- paste0("./final/", localization, "/", localization, ".", source, ".txt")
# read the text file lines into vector of strings
source.lines <- readLines(source.file, skipNul=TRUE, encoding=charset, warn=FALSE, n=samples)
# combined all requested datasets/sources in one corpora
corpora <- c(corpora, source.lines)
}
return(corpora)
}
Each dataset has been loaded separately to get its own basic summaries in the next step:
news.data <- loading.data("en_US", src="news")
blogs.data <- loading.data("en_US", src="blogs")
twitter.data <- loading.data("en_US", src="twitter")
Wikipedia may be a good suggestion for external data sets to augment the model
We don’t want our model to predict any numbers or special characters, and we will focus only on words to serve the auto-complete functionality which we are going to build. To do this, we performed a radical and faster way by simply exclude all other characters but letters as well as apostrophe, dot, and dash characters only when mentioned within words (e.g. don’t, e-mail, ph.d.)
tokenization <- function(x){
x <- iconv(x, from="UTF-8", to="latin1", sub=" ")
# convert the whole string to lower-case
x <- tolower(x)
# remove all digits and special characters but letters, space, apostrophe, dot, and dash characters
# to keep counting words like: don't, u.s.a, and e-mail
x <- gsub("[^a-z'. -]", " ", x)
# remove apostrophe, dot, and dash characters if they are at the beginning or end of the sentence
x <- gsub("^['.-]", "", x)
x <- gsub("['.-]$", "", x)
# remove apostrophe, dot, and dash characters if they are at the beginning or end of the word
x <- gsub(" ['.-]", " ", x)
x <- gsub("['.-] ", " ", x)
# strip extra spaces
x <- gsub(" {2,}", " ", x)
return(x)
}
It is not proper to let our model predict any bad words, so we will exclude those bad words from the corpora before build the model. There are several resources on the web provides a list of bad words, we used the simple list which published at http://www.bannedwordlist.com/.
To perform this task we utilise the R Text Mining package to create a corpus object first starting from our corpora data which loaded as a vector of strings:
x <- VCorpus(VectorSource(x))
Then we pass that corpus object to the following filtering function:
profanity.filtering <- function(x){
# assuming you download the bad words list and save it in the project working directory
bad.words <- readLines("./bad_words.txt")
x <- tm_map(x, removeWords, bad.words)
# because when we removed any bad word we may get two spaces in sequence (i.e. before and after it)
x <- tm_map(x, stripWhitespace)
return(x)
}
Please note that bad words list should be excluded from the corpora NOT the stop words list which includes the most common words in the language (in other words, they may appears at the top of our prediction list).
# to get number of lines in loaded corpora x
length(x)
# to get longest line in a given corpora x
max(nchar(x))
# to get total number of words in a given corpora x
sum(sapply(gregexpr("\\s", x), length) + 1)
| Language | Dataset | # Lines | # Words | Longest Line |
|---|---|---|---|---|
| en_US | News | 77259 | 2609413 | 5760 |
| en_US | Blogs | 899288 | 37246919 | 40833 |
| en_US | 2360148 | 30649855 | 140 |
Please note that presented total number of words for each dataset has been calculated after cleanup the corpora, while the longest line calculated form the raw data as written in corpora file
To build models we don’t need to load in and use all of the data. On the other hand, so we create the following function to do that job with a flexibility to formulate a several sub sets via define a vector of splitting ratio and associate name for each sub set. This function will also save each sub set in a separate file as a check point to avoid re-run all the previous steps each time.
sampling.data <- function(x, pr=c(0.7, 0.2, 0.1), txt=c("training", "testing", "validation")){
index <- sample(1:length(pr), size=length(x), prob=pr, replace=TRUE)
for(i in 1:length(pr)){
sub.sample <- x[index==i]
writeLines(sub.sample, paste0("./", txt[i], ".txt"))
}
return(x[index==1])
}
Using this function we managed to split our corpus database into the following sub sets:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus source.
UniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
uni.gram <- TermDocumentMatrix(x, control=list(tokenize=UniGramTokenizer))
uni.freq <- findFreqTerms(uni.gram)
uni.freq <- sort(rowSums(as.matrix(uni.gram[uni.freq,])), decreasing=TRUE)
uni.prob <- uni.freq/sum(uni.freq)
barplot(uni.prob[1:20], las=3, main="Top Unigrams by Probability", ylab="Probability")
BiGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
bi.gram <- TermDocumentMatrix(x, control=list(tokenize=BiGramTokenizer))
bi.freq <- findFreqTerms(bi.gram)
bi.freq <- sort(rowSums(as.matrix(bi.gram[bi.freq,])), decreasing=TRUE)
bi.prob <- bi.freq/sum(bi.freq)
barplot(bi.prob[1:20], las=3, main="Top Bigrams by Probability", ylab="Probability")
We will use created N-gram data-frames to calculate the probabilities of the next word occurring with respect to previous words. We can use a dictionary to reduce the size required to save the model via referring to each word by a number, and we may also exclude the rare cases which has no chance to view in the suggestions.
This prediction algorithm will be implemented in a simple Shiny app which has an input field where user can insert a text and the application will list down interactively top 5 suggested words to auto-complete current word, this prediction will be refined by filter suggestions according to the inserted letters so far of the ongoing written word.
To avoid limitation in required resources to calculate N-gram frequencies, you may get benefit from this available resource: Google Research Blog, All Our N-gram are Belong to You