Prepare Data (I)

This project is to develop N-gram predictive text models. Specifically we will develop a trigram Katz Backoff model, which generates a word using two words. This project is complicated and we will document its development in a number of successive RMarkdown files.

This document is the first one and cover data preparation. The tasks of data preparation include: - clean up the corpus - divide the corpus into two parts: one for training models and one for testing - create of unigram, bigram and trigram data sets from the training corpus

The outcomes of data preparation are: - the training corpus - the test corpus - three .rds files respectively containing the unigrams, bigrams, and trigrams

The second and follow-up RMardown file will explore the three ngram data sets. As we see later, the bigram and trigram sets both contain a huge number of items and their files are large. The second RMardown file - “exploreData” also discuss ways to reduce the bigram and trigram data sets.

Create training and test corpora

The initail corpus was downloaded from the site. It consists of three sets of documents in three different langlange. In this project, the English set, which includes three plain text files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

They are placed in the folder “./data/”.

Basic statistics of the original corpus

The following code provides some basic statistics of the corpus, i.e., file sizes, lines of text, word counts. In summary, three files together are 550 MB in size, have over 4 million lines of text, and contain about 1.0 billion words.

An exploratory analysis and inspection of the texts found they contain non-English words, including foreign words and mispelling. We use an English word list to remove incorrect or non-English words from the N-gram data set initially created. The English word list was downloaded from the website. The list contains about 77,000 entries and saved as “en_US.txt” in the folder “./dictionary/”

## blogs
size_blog <- file.size("./data/en_US.blogs.txt")/2^20 # 200MB
blogs <- readLines("./data/en_US.blogs.txt")
lines_blog <- length(blogs) # 899,288 lines
words_blog <- unlist(strsplit(blogs, " "))
word_count_blog <- length(words_blog) #37,334,131

## news
size_news <- file.size("./data/en_US.news.txt")/2^20 # 196MB
news <- readLines("./data/en_US.news.txt")
lines_news <- length(news) # 1,010,242 lines
words_news <- unlist(strsplit(news, " "))
word_count_news <- length(words_news) #34,372,530

## twitter
size_twitter <- file.size("./data/en_US.twitter.txt")/2^20 # 159MB
twitter <- readLines("./data/en_US.twitter.txt")
lines_twitter <- length(twitter) # 2,360,148 lines
words_twitter <- unlist(strsplit(twitter, " "))
word_count_twitter <- length(words_twitter) #30,373,543

## English word list
en_US <-readLines("./dictionary/en_US.txt")
entries <- length(en_US) # 77,722 words

create a clean corpus

Before creating the training and test corpora, we want first to clean the original coporus and create a clean corpus of three documents: - en_US.blogs.clean.txt - en_US.news.clean.txt - en_US.twitter.clean.txt

The new corpus will be placed in folder “./corpus/”.

In the follow code chunks, we primarily employ the functions in the tm package to clean up the corpus. Because the size of the original corpus and various actions, it takes a lot time to run the code.

library(tm)
library(dplyr)
#create a Corpus object from three sample text files 
corpus <- VCorpus(DirSource("./data/"), 
                  readerControl = list(language = "en_US"))
# meta(corpus[[1]])

clean <- corpus %>% 
    tm_map(stripWhitespace) %>%
    tm_map(content_transformer(function(x) gsub("[^[:alnum:][:space:]'`]", " ", x))) %>%
    #tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(content_transformer(tolower))
rm(corpus);gc()

remove profanity words

The project requires profane words be removed. There are varius lists of profanity words. We chose to use the data set profanity_arr_bad provided by the lexicon package. The data set contains a character vector of 343 profanity words.

library(lexicon)
data("profanity_arr_bad")

rm_profanity <- clean %>%
  tm_map(removeWords, profanity_arr_bad)

if(!file.exists("./cleanCorpus/blog.txt"))
    writeCorpus(rm_profanity, 
                path="./cleanCorpus/", 
                filenames = c("blog.txt","news.txt","twitter.txt"))

rm(clean, rm_profanity); gc()

partition the text

The corpus is partitiioned into a training corpus (80% lines of the original corpus) and a test corpus (20% of the original corpus).

set.seed(123)

## blogs
blogs <- readLines("./cleanCorpus/blog.txt")
index <- sample(1:length(blogs),
                round(length(blogs)*.8))
train.blogs <- blogs[index]
test.blogs <- blogs[-index]
if(!file.exists("./training/train.blogs.txt")){
    writeLines(train.blogs, "./training/train.blogs.txt")
    writeLines(test.blogs, "./test/test.blogs.txt")
}
rm(blogs, train.blogs,test.blogs); gc()

## news
news <- readLines("./cleanCorpus/news.txt")
index <- sample(1:length(news),
                round(length(news)*.8))
train.news <- news[index]
test.news <- news[-index]
if(!file.exists("./training/train.news.txt")){
    writeLines(train.news, "./training/train.news.txt")
    writeLines(test.news, "./test/test.news.txt")
}
rm(news, train.news,test.news); gc()

## news
twitter <- readLines("./cleanCorpus/twitter.txt")
index <- sample(1:length(twitter),
                round(length(twitter)*.8))
train.twitter <- twitter[index]
test.twitter <- twitter[-index]
if(!file.exists("./training/train.twitter.txt")){
    writeLines(train.twitter, "./training/train.twitter.txt")
    writeLines(test.twitter, "./test/test.twitter.txt")
}
rm(twitter, train.twitter,test.twitter); gc()

Create N-gram sets in the training corpus

We will create unigrams, bigrams and trigrams from the training corpus. We first merge the three documents in the corpus and create a tibble data frame of one variable, each row contains one line of text in the corpus.

We employ tidytext package in tokenizing the corpus. Because of the size of the corpus, and a large vocabulary (or a large number of unigrams), it takes a lot time to create bigrams and trigrams data. The three initial ngram data set are further cleaned up by removing those items containing “words” not in the English word list mentioned earlier.

The resulted unigrams, bigrams and trigrams are saved in three files in the directory “./ngrams/”: - unigs.rds: storing unigrams - bigrs.rds: storing bigrams - trigrs.rds: storing trigrams

The following code chunks create and save these ngrams.

The code gives sizes of the n-gram sets. Specifically, - unigrams: the initial set (unigram) contains 516,911 unigrams (or words); after filtering out those not in the English word list (en_US), we get a unigram set (unigs) containing 63,602 words (in comparison to 77,000 words in the word list). - bigrams: the initial set (bigram) contains 12,829,552 items; even after removing the items containing words not in the English word list, we still have 8,420,988 items in bigrs. The file “bigrs.rds” is 42MB. - trigrams: the initial set (trigram) contains 42,327,380 items; after removing the items containing words not in the word list, we still have 31,780,053 trigrams in trigrs. The file “trigrs.rsd” is 183 MB.

In summary, we have got three ngram data files: unigs.rds, bigrs.rds, and trigrs.rds. Both “bigr.rds” and “trigrs.rdds” contain large numbers of items and are large files in size.

library(tidytext)
library(dplyr)
library(ggplot2)
library(tidyr)

en_US <- readLines("./dictionary/en_US.txt") # list of about 77,000 English words

blogs <- readLines("./training/train.blogs.txt")
news <- readLines("./training/train.news.txt")
twitter <- readLines("./training/train.twitter.txt")

tf_text <- tibble(text = c(blogs, news, twitter))

head(tf_text)

## unigram

unigram <- tf_text %>%
  unnest_tokens(ngram, text) %>%
  count(ngram, sort=TRUE)
length(unigram$ngram) # 516,911 unigrams

unigs <- unigram %>%
  filter(ngram %in% en_US) %>%
  rename(freq=n)
length(unigs$ngram)  # 63602
if(!file.exists("./ngrams/unigs.rds")) 
  saveRDS(unigs, "./ngrams/unigs.rds")

## bigram
bigram <- tf_text %>%
  unnest_tokens(ngram, text, token="ngrams", n=2) %>%
  count(ngram, sort=TRUE)
length(bigram$ngram) # 12,829,552

## remove bigrams that contain words not in the English word list en_US
bigrs <- bigram %>%
  separate(ngram, c("w2", "w1"), sep = " ") %>%
  filter((w2 %in% en_US) & (w1 %in% en_US)) %>%
  unite(ngram, c("w2","w1"), sep = " ") %>%
  rename(freq=n)
length(bigrs$ngram) #8,420,988

if(!file.exists("./ngrams/bigrs.rds")) 
  saveRDS(bigrs, "./ngrams/bigrs.rds")

## trigram
## note it takes 40 minutes to get the trigrams; the main issue may be memory
trigram <- tf_text %>%
  unnest_tokens(ngram, text, token="ngrams", n=3) %>%
  count(ngram, sort=TRUE)
length(trigram$ngram) # 42,327,380

trigrs <- trigram %>%
  separate(ngram, c("w3", "w2", "w1"), sep = " ") %>%
  filter((w3 %in% en_US) & (w2 %in% en_US) & (w1 %in% en_US)) %>%
  unite(ngram, c("w3","w2","w1"), sep = " ") %>%
  rename(freq=n)
length(trigrs$ngram) #31,780,053

if(!file.exists("./ngrams/trigrs.rds")) 
  saveRDS(trigrs, "./ngrams/trigrs.rds")