This project is to develop N-gram predictive text models. Specifically we will develop a trigram Katz Backoff model, which generates a word using two words. This project is complicated and we will document its development in a number of successive RMarkdown files.
This document is the first one and cover data preparation. The tasks of data preparation include: - clean up the corpus - divide the corpus into two parts: one for training models and one for testing - create of unigram, bigram and trigram data sets from the training corpus
The outcomes of data preparation are: - the training corpus - the test corpus - three .rds files respectively containing the unigrams, bigrams, and trigrams
The second and follow-up RMardown file will explore the three ngram data sets. As we see later, the bigram and trigram sets both contain a huge number of items and their files are large. The second RMardown file - “exploreData” also discuss ways to reduce the bigram and trigram data sets.
The initail corpus was downloaded from the site. It consists of three sets of documents in three different langlange. In this project, the English set, which includes three plain text files:
They are placed in the folder “./data/”.
The following code provides some basic statistics of the corpus, i.e., file sizes, lines of text, word counts. In summary, three files together are 550 MB in size, have over 4 million lines of text, and contain about 1.0 billion words.
An exploratory analysis and inspection of the texts found they contain non-English words, including foreign words and mispelling. We use an English word list to remove incorrect or non-English words from the N-gram data set initially created. The English word list was downloaded from the website. The list contains about 77,000 entries and saved as “en_US.txt” in the folder “./dictionary/”
## blogs
size_blog <- file.size("./data/en_US.blogs.txt")/2^20 # 200MB
blogs <- readLines("./data/en_US.blogs.txt")
lines_blog <- length(blogs) # 899,288 lines
words_blog <- unlist(strsplit(blogs, " "))
word_count_blog <- length(words_blog) #37,334,131
## news
size_news <- file.size("./data/en_US.news.txt")/2^20 # 196MB
news <- readLines("./data/en_US.news.txt")
lines_news <- length(news) # 1,010,242 lines
words_news <- unlist(strsplit(news, " "))
word_count_news <- length(words_news) #34,372,530
## twitter
size_twitter <- file.size("./data/en_US.twitter.txt")/2^20 # 159MB
twitter <- readLines("./data/en_US.twitter.txt")
lines_twitter <- length(twitter) # 2,360,148 lines
words_twitter <- unlist(strsplit(twitter, " "))
word_count_twitter <- length(words_twitter) #30,373,543
## English word list
en_US <-readLines("./dictionary/en_US.txt")
entries <- length(en_US) # 77,722 words
Before creating the training and test corpora, we want first to clean the original coporus and create a clean corpus of three documents: - en_US.blogs.clean.txt - en_US.news.clean.txt - en_US.twitter.clean.txt
The new corpus will be placed in folder “./corpus/”.
In the follow code chunks, we primarily employ the functions in the tm package to clean up the corpus. Because the size of the original corpus and various actions, it takes a lot time to run the code.
library(tm)
library(dplyr)
#create a Corpus object from three sample text files
corpus <- VCorpus(DirSource("./data/"),
readerControl = list(language = "en_US"))
# meta(corpus[[1]])
clean <- corpus %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(function(x) gsub("[^[:alnum:][:space:]'`]", " ", x))) %>%
#tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower))
rm(corpus);gc()
remove profanity words
The project requires profane words be removed. There are varius lists of profanity words. We chose to use the data set profanity_arr_bad provided by the lexicon package. The data set contains a character vector of 343 profanity words.
library(lexicon)
data("profanity_arr_bad")
rm_profanity <- clean %>%
tm_map(removeWords, profanity_arr_bad)
if(!file.exists("./cleanCorpus/blog.txt"))
writeCorpus(rm_profanity,
path="./cleanCorpus/",
filenames = c("blog.txt","news.txt","twitter.txt"))
rm(clean, rm_profanity); gc()
The corpus is partitiioned into a training corpus (80% lines of the original corpus) and a test corpus (20% of the original corpus).
set.seed(123)
## blogs
blogs <- readLines("./cleanCorpus/blog.txt")
index <- sample(1:length(blogs),
round(length(blogs)*.8))
train.blogs <- blogs[index]
test.blogs <- blogs[-index]
if(!file.exists("./training/train.blogs.txt")){
writeLines(train.blogs, "./training/train.blogs.txt")
writeLines(test.blogs, "./test/test.blogs.txt")
}
rm(blogs, train.blogs,test.blogs); gc()
## news
news <- readLines("./cleanCorpus/news.txt")
index <- sample(1:length(news),
round(length(news)*.8))
train.news <- news[index]
test.news <- news[-index]
if(!file.exists("./training/train.news.txt")){
writeLines(train.news, "./training/train.news.txt")
writeLines(test.news, "./test/test.news.txt")
}
rm(news, train.news,test.news); gc()
## news
twitter <- readLines("./cleanCorpus/twitter.txt")
index <- sample(1:length(twitter),
round(length(twitter)*.8))
train.twitter <- twitter[index]
test.twitter <- twitter[-index]
if(!file.exists("./training/train.twitter.txt")){
writeLines(train.twitter, "./training/train.twitter.txt")
writeLines(test.twitter, "./test/test.twitter.txt")
}
rm(twitter, train.twitter,test.twitter); gc()
We will create unigrams, bigrams and trigrams from the training corpus. We first merge the three documents in the corpus and create a tibble data frame of one variable, each row contains one line of text in the corpus.
We employ tidytext package in tokenizing the corpus. Because of the size of the corpus, and a large vocabulary (or a large number of unigrams), it takes a lot time to create bigrams and trigrams data. The three initial ngram data set are further cleaned up by removing those items containing “words” not in the English word list mentioned earlier.
The resulted unigrams, bigrams and trigrams are saved in three files in the directory “./ngrams/”: - unigs.rds: storing unigrams - bigrs.rds: storing bigrams - trigrs.rds: storing trigrams
The following code chunks create and save these ngrams.
The code gives sizes of the n-gram sets. Specifically, - unigrams: the initial set (unigram) contains 516,911 unigrams (or words); after filtering out those not in the English word list (en_US), we get a unigram set (unigs) containing 63,602 words (in comparison to 77,000 words in the word list). - bigrams: the initial set (bigram) contains 12,829,552 items; even after removing the items containing words not in the English word list, we still have 8,420,988 items in bigrs. The file “bigrs.rds” is 42MB. - trigrams: the initial set (trigram) contains 42,327,380 items; after removing the items containing words not in the word list, we still have 31,780,053 trigrams in trigrs. The file “trigrs.rsd” is 183 MB.
In summary, we have got three ngram data files: unigs.rds, bigrs.rds, and trigrs.rds. Both “bigr.rds” and “trigrs.rdds” contain large numbers of items and are large files in size.
library(tidytext)
library(dplyr)
library(ggplot2)
library(tidyr)
en_US <- readLines("./dictionary/en_US.txt") # list of about 77,000 English words
blogs <- readLines("./training/train.blogs.txt")
news <- readLines("./training/train.news.txt")
twitter <- readLines("./training/train.twitter.txt")
tf_text <- tibble(text = c(blogs, news, twitter))
head(tf_text)
## unigram
unigram <- tf_text %>%
unnest_tokens(ngram, text) %>%
count(ngram, sort=TRUE)
length(unigram$ngram) # 516,911 unigrams
unigs <- unigram %>%
filter(ngram %in% en_US) %>%
rename(freq=n)
length(unigs$ngram) # 63602
if(!file.exists("./ngrams/unigs.rds"))
saveRDS(unigs, "./ngrams/unigs.rds")
## bigram
bigram <- tf_text %>%
unnest_tokens(ngram, text, token="ngrams", n=2) %>%
count(ngram, sort=TRUE)
length(bigram$ngram) # 12,829,552
## remove bigrams that contain words not in the English word list en_US
bigrs <- bigram %>%
separate(ngram, c("w2", "w1"), sep = " ") %>%
filter((w2 %in% en_US) & (w1 %in% en_US)) %>%
unite(ngram, c("w2","w1"), sep = " ") %>%
rename(freq=n)
length(bigrs$ngram) #8,420,988
if(!file.exists("./ngrams/bigrs.rds"))
saveRDS(bigrs, "./ngrams/bigrs.rds")
## trigram
## note it takes 40 minutes to get the trigrams; the main issue may be memory
trigram <- tf_text %>%
unnest_tokens(ngram, text, token="ngrams", n=3) %>%
count(ngram, sort=TRUE)
length(trigram$ngram) # 42,327,380
trigrs <- trigram %>%
separate(ngram, c("w3", "w2", "w1"), sep = " ") %>%
filter((w3 %in% en_US) & (w2 %in% en_US) & (w1 %in% en_US)) %>%
unite(ngram, c("w3","w2","w1"), sep = " ") %>%
rename(freq=n)
length(trigrs$ngram) #31,780,053
if(!file.exists("./ngrams/trigrs.rds"))
saveRDS(trigrs, "./ngrams/trigrs.rds")