The Data

In this project, the en_US dataset are used as the corpus for predictive models. Following are some basic statistics of the files in the en_US folder:

File Name File Size (Bytes) Number of Lines Max Line Length Number of Words
en_US.blogs.txt 210160014 899288 40833 37334114
en_US.news.txt 205811889 1010242 11384 34365936
en_US.twitter.txt 167105338 2360148 173 30359852

Data Exploration

As shown in the table above, each of the file contains more than 30 million words. In order to speed up the exploration process, 10000 lines of random sampled entries from the en_US.blog.txt,en_US.news.txt, and en_US.twitter.txt were used for the exploration.

# Sample 10000 lines from the corpus with specified random seed
set.seed(90910)
blogs <- readLines("corpus/en_US.blogs.txt")
news <- readLines("corpus//en_US.news.txt")
twitter <- readLines("corpus//en_US.twitter.txt")
corpus <- c(blogs, news, twitter)
sample.index <- sample(length(corpus), 10000)
sample <- corpus[sample.index]
## [1] 229277

The sampled text (229277 words) was preprocessed with tolower, removePunctuation, removeNumbers, stripWhitespace, and the default stopwords in the tm package were removed.

# Tokenizer
BigramTokenizer <- function(x){
  require(RWeka)
  NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
}

TrigramTokenizer <- function(x){
  require(RWeka)
  NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))
}
# Function to compute term-frequency with certain preprocessing
computeTF <- function(doc, tokenizer="words"){
  ctrl <- list(tokenizer=tokenizer,
               tolower=TRUE,
               removePunctuation=TRUE,
               removeNumbers=TRUE,
               stripWhitespace=TRUE,
               stopwords=stopwords("english")
  )
  tf <- termFreq(PlainTextDocument(doc),control=ctrl)
  return(sort(tf, decreasing=TRUE))
}
# Exploring the sampled data
tf.1 <- computeTF(sample)
tf.2 <- computeTF(sample, tokenizer=BigramTokenizer)
tf.3 <- computeTF(sample, tokenizer=TrigramTokenizer)

The following script showed the 20 most frequent unigrams, bigrams, and trigrams in the sampled text.

# Print out the top 20 most frequent unigrams
termFrequency <- data.frame(unigram=names(tf.1)[1:20], frequency=as.numeric(tf.1)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top Unigrams by Frequency")
print(g)

# Print out the top 20 most frequent bigrams
termFrequency <- data.frame(bigram=names(tf.2)[1:20], frequency=as.numeric(tf.2)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top Bigrams by Frequency")
print(g)

# Print out the top 20 most frequent trigrams
termFrequency <- data.frame(trigram=names(tf.2)[1:20], frequency=as.numeric(tf.2)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Top Trigrams by Frequency")
print(g)

The provided dataset is very large, and hence it is essential to understand how well a randomly sampled subset can represent the complete dataset. In the script above, the unigrams, bigrams, and trigrams were derived from the sampled text and stored in the tf.1, tf.2, and tf.3 object. The following table summarizes the statistics of unigrams, bigrams, and trigrams in the sampled text.

TotalNumOfNgrams NumOfNgramsToCover50% Cover75% Cover90% CoverageOfNonTrivialNgrams
unigram 22653 3.53 16.28 50.09 88.19
bigram 117970 14.51 55.87 82.35 53.13
trigram 177795 44.06 72.03 88.81 15.66

As shown in the table, the 3.53% most frequent unigrams (799) can cover 50% of the sampled text, and 3688 (16.28%) for 75%, and 11346 (50.01%) for 90%. Also, only 9298 of the unigrams have frequncy higher than 1 (41.05%), but these terms cover 88.19% of total unigrams.

For bigrams and trigrams, there are more varieties of combinations, and more terms have very low frequencies. Hence, we need more terms to cover the same percentage of all terms.

We further investigated the bigrams and trigrams of this sample. The size of total bigrams is 71590, and 101005 for trigrams. While this is the case for a sample of 10,000 lines of twitter, we expect the n-grams for the whole dataset to be very large. In order to implement the word prediction application with limited resources, we will reduce the size of total n-grams by smoothing methods suggested in Good-Turing frequency estimation and Katz’s back-off model. And then we will try use hashing mechanism to compress the n-gram table stored in system memory.