The goal of the Capstone project is to develop a text predictive model using a large, unstructured database of the English language. The data is from a corpus called HC Corpora which is collected from publicly available sources (tweets, blogs and news) by a web crawler.
This milestone report focuses on exploratory analysis of the data set.
The final product is a shiny web application where, based on user input, five next-word candidates will be presented. The details for this will be covered in a subsequent report.
## Loading required package: NLP
#######################################################################################
# Read in the data
#######################################################################################
blogs <- readLines("en_US.blogs.txt", warn=FALSE, encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", warn=FALSE, encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", warn=FALSE, encoding="UTF-8", skipNul=TRUE)
combine <- c(blogs, news, twitter)
#######################################################################################
# Explore the data
#######################################################################################
# Function to compute the number of words
wordCount <- function(x) sum(stringr::str_count(x,"\\S+"))
# Function to compute the number of unique words
uniqueCount <- function(x){
sum(stringr::str_count(toString(unique(unlist(strsplit(x,"\\s+|[[:punct:]]")))),"\\S+"))
}
# Function to compute the median and maximum number of characters
medNChar <- function(x) median(nchar(x))
maxNChar <- function(x) max(nchar(x))
# Compute no. of lines, words, unique words and ratio of unique to total words
list <- list(blogs, news, twitter, combine)
LineCount <- sapply(list,length)
WordCount <- sapply(list,wordCount)
UniqueCount <- sapply(list,uniqueCount)
UniqueRatio <- UniqueCount/WordCount
MedNChar <- sapply(list, medNChar)
MaxNchar <- sapply(list, maxNChar)
## Lines Words Uniques Uniques_Words MedNChar MaxNchar
## Blogs 899,288 37,334,131 450,711 1.21% 156 40833
## News 77,259 2,643,969 95,150 3.6% 186 5760
## Twitter 2,360,148 30,373,583 462,731 1.52% 64 140
## Combine 3,336,695 70,351,683 774,838 1.1% 73 40833
The bulk of the 70+M words from the English corpus is contributed by blogs then tweets in roughly equal proportions, followed by a much smaller proportion from news articles.
We split the data into training set (60%), validation set (20%) and test sets (20%).
combine <- readRDS("combine.rds")
# Randomize the combine data
set.seed(123)
combine <- sample(combine, length(combine))
set.seed(123)
split <- sample.split(combine, SplitRatio=0.6)
train <- subset(combine, split==TRUE)
rest <- subset(combine, split==FALSE)
split2 <- sample.split(rest, SplitRatio=0.5)
validate <- subset(rest, split2==TRUE)
test <- subset(rest, split2==FALSE)
We create the training corpus using VCorpus (volatile corpus) instead of just Corpus (simple corpus) as the latter resulted in 1 grams being returned in the document term matrix when I create 2-grams using TM and RWeka packages. This is also reported in Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus.
The steps taken to clean the corpus like removing punctuation marks, convert to lower case, etc. are indicated in the comment lines below. I remove profanity by removing banned words maintained at the site http://www.bannedwordlist.com/.
I did not remove stopwords or stem the document since we want to predict the next word based on the user input string and doing so will result in ngrams that lose contextual information.
# Make corpus
corpus.train <- VCorpus(VectorSource(train))
# Convert to unicode
convertUnicode <- function (x) stringi::stri_trans_general(x, "latin-ascii")
corpus.train <- tm_map(corpus.train, content_transformer(convertUnicode))
# Separate words separated by "-" or "/"
toSpace <- content_transformer(function(x, pattern) gsub(pattern," ", x, perl=TRUE))
corpus.train <- tm_map(corpus.train, toSpace, "-")
corpus.train <- tm_map(corpus.train, toSpace, "/")
# Remove all punctuations except apostrophe
removeSpecial <- function(x) gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x, perl=TRUE)
corpus.train <- tm_map(corpus.train, content_transformer(removeSpecial))
# Remove emojis
corpus.train <- tm_map(corpus.train, toSpace, "[^[:graph:]']")
# Convert to lower case
corpus.train <- tm_map(corpus.train, content_transformer(tolower))
# Remove numbers
corpus.train <- tm_map(corpus.train, removeNumbers)
# Remove banned word list from http://www.bannedwordlist.com/
swearwords <- VectorSource(readLines("swearWords.txt", warn=FALSE,
encoding="UTF-8", skipNul=TRUE))
corpus.train <- tm_map(corpus.train, removeWords, swearwords)
# Remove errant "'s" introduced by the above steps
corpus.train <- tm_map(corpus.train, toSpace, " 's")
# Remove errant " ' " introduced by the above steps
corpus.train <- tm_map(corpus.train, toSpace, " ' ")
# Strip whitespace
corpus.train <- tm_map(corpus.train, stripWhitespace)
saveRDS(corpus.train, "corpus.train.rds")
Initially I used the Rweka package to create the ngrams. The tokens created looked weird as the apostrophes are removed. For example “don’t” becomes “don” and “t” and considered as two words. I researched and found that the ngram tokenizer based on weka removes punctuations when doing the tokenizing. As a result, I switched to the NLP package to do the tokenization.
I run each of the ngrams (1-grams, 2-grams, 3-grams, 4-grams and 5-grams) in separate scripts because of memory issues. After creating the document term matrix, I only keep terms for ngrams that occur more than 100 times to prevent running into memory allocation issues. For the pentagram, I can only keep terms that occur in more than 100 of the documents again due to memory limitations with lower thresholds.
I would have preferred to set a much lower threshold to keep as many terms as possible so as to help in predicting unseen words but am restricted by computational power.
My bigram.R script is shown below where I rank the terms according to its frequency of occurence.
# Create bigram
corpus.train <- readRDS("corpus.train.rds")
# Create bigram, keep terms that appears at least 100 times
gram.2 <- function(x) unlist(lapply(NLP::ngrams(words(x),2), paste, collapse=" "), use.names=FALSE)
dtm.2g <- DocumentTermMatrix(corpus.train, control=list(tokenize=gram.2))
dtm.2g
colTot2 <- col_sums(dtm.2g)
dtm.2gs <- dtm.2g[,which(colTot2>100)]
dtm.2gs
freq.2g <- colapply_simple_triplet_matrix(dtm.2gs,FUN=sum)
freq.2g <- sort(freq.2g, decreasing=T)
df.2g <- data.frame(word=names(freq.2g), frequency=freq.2g, row.names=NULL, stringsAsFactors=FALSE)
saveRDS(df.2g, "df.2g.rds")
The reduced document term matrix for the ngrams are as below:
dtm.1gs
## <<DocumentTermMatrix (documents: 2002017, terms: 18499)>>
## Non-/sparse entries: 37353158/36997959325
## Sparsity : 100%
## Maximal term length: 17
## Weighting : term frequency (tf)
dtm.2gs
## <<DocumentTermMatrix (documents: 2002017, terms: 45728)>>
## Non-/sparse entries: 24006787/91524226589
## Sparsity : 100%
## Maximal term length: 22
## Weighting : term frequency (tf)
dtm.3gs
## <<DocumentTermMatrix (documents: 2002017, terms: 21552)>>
## Non-/sparse entries: 5987081/43141483303
## Sparsity : 100%
## Maximal term length: 30
## Weighting : term frequency (tf)
dtm.4gs
## <<DocumentTermMatrix (documents: 2002017, terms: 3564)>>
## Non-/sparse entries: 743271/7134445317
## Sparsity : 100%
## Maximal term length: 38
## Weighting : term frequency (tf)
dtm.5gs
## <<DocumentTermMatrix (documents: 2002017, terms: 427)>>
## Non-/sparse entries: 76677/854784582
## Sparsity : 100%
## Maximal term length: 47
## Weighting : term frequency (tf)
Next We do a ggplot for the ngrams and its top 20 words.
The next step is to create a predictive model which reads in user inputs and predicts the top 5 next words based on the stupid backoff algorithm. The idea is to first match the last 4 words of the input sentence with the first 4 words of the pentagram. If the count of the match is less than 5, match the last 3 words of the input sentence with the 1st 3 words of the quadgram. If the cumulative count of matches is still less than 5, repeat the above with trigram and then bigram. Finally if the cummulaive count of matches is still less than 5, fill the remaining words with the top unigrams.
We will also look at the Knesser-Nay smoothing algorithm which takes lower-order ngrams into consideration instead of matching from the highest order ngram downwards. We will then test the algorithms with the test set created.
One challenge I see is how to report the top words quickly given the large data size and computation involved.
The word prediction algorithms will finally be implemented in shiny apps. As the user keys in their words and presses on the space bar, five next word candidates will be displayed with its score.