File Name	File Size (Bytes)	Number of Lines	Max Line Length	Number of Words
en_US.blogs.txt	210160014	899288	40833	37334114
en_US.news.txt	205811889	1010242	11384	34365936
en_US.twitter.txt	167105338	2360148	173	30359852

Data Exploration

As shown in the table above, each of the file contains more than 30 million words. In order to speed up the exploration process, 10000 lines of random sampled entries from the en_US.blog.txt,en_US.news.txt, and en_US.twitter.txt were used for the exploration.

# Sample 10000 lines from the corpus with specified random seed
set.seed(90910)
blogs <- readLines("corpus/en_US.blogs.txt")
news <- readLines("corpus//en_US.news.txt")
twitter <- readLines("corpus//en_US.twitter.txt")
corpus <- c(blogs, news, twitter)
sample.index <- sample(length(corpus), 10000)
sample <- corpus[sample.index]

## [1] 229277

The sampled text (229277 words) was preprocessed with tolower, removePunctuation, removeNumbers, stripWhitespace, and the default stopwords in the tm package were removed.

# Tokenizer
BigramTokenizer <- function(x){
  require(RWeka)
  NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
}

TrigramTokenizer <- function(x){
  require(RWeka)
  NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))
}
# Function to compute term-frequency with certain preprocessing
computeTF <- function(doc, tokenizer="words"){
  ctrl <- list(tokenizer=tokenizer,
               tolower=TRUE,
               removePunctuation=TRUE,
               removeNumbers=TRUE,
               stripWhitespace=TRUE,
               stopwords=stopwords("english")
  )
  tf <- termFreq(PlainTextDocument(doc),control=ctrl)
  return(sort(tf, decreasing=TRUE))
}
# Exploring the sampled data
tf.1 <- computeTF(sample)
tf.2 <- computeTF(sample, tokenizer=BigramTokenizer)
tf.3 <- computeTF(sample, tokenizer=TrigramTokenizer)

The following script showed the 20 most frequent unigrams, bigrams, and trigrams in the sampled text.

# Print out the top 20 most frequent unigrams
termFrequency <- data.frame(unigram=names(tf.1)[1:20], frequency=as.numeric(tf.1)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top Unigrams by Frequency")
print(g)

# Print out the top 20 most frequent bigrams
termFrequency <- data.frame(bigram=names(tf.2)[1:20], frequency=as.numeric(tf.2)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(bigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top Bigrams by Frequency")
print(g)

# Print out the top 20 most frequent trigrams
termFrequency <- data.frame(trigram=names(tf.2)[1:20], frequency=as.numeric(tf.2)[1:20])
g <- ggplot(termFrequency, aes(x=reorder(trigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Top Trigrams by Frequency")
print(g)

The provided dataset is very large, and hence it is essential to understand how well a randomly sampled subset can represent the complete dataset. In the script above, the unigrams, bigrams, and trigrams were derived from the sampled text and stored in the tf.1, tf.2, and tf.3 object. The following table summarizes the statistics of unigrams, bigrams, and trigrams in the sampled text.

	TotalNumOfNgrams	NumOfNgramsToCover50%	Cover75%	Cover90%	CoverageOfNonTrivialNgrams
unigram	22653	3.53	16.28	50.09	88.19
bigram	117970	14.51	55.87	82.35	53.13
trigram	177795	44.06	72.03	88.81	15.66

As shown in the table, the 3.53% most frequent unigrams (799) can cover 50% of the sampled text, and 3688 (16.28%) for 75%, and 11346 (50.01%) for 90%. Also, only 9298 of the unigrams have frequncy higher than 1 (41.05%), but these terms cover 88.19% of total unigrams.

For bigrams and trigrams, there are more varieties of combinations, and more terms have very low frequencies. Hence, we need more terms to cover the same percentage of all terms.

We further investigated the bigrams and trigrams of this sample. The size of total bigrams is 71590, and 101005 for trigrams. While this is the case for a sample of 10,000 lines of twitter, we expect the n-grams for the whole dataset to be very large. In order to implement the word prediction application with limited resources, we will reduce the size of total n-grams by smoothing methods suggested in Good-Turing frequency estimation and Katz’s back-off model. And then we will try use hashing mechanism to compress the n-gram table stored in system memory.

Milestone Report: Word Prediction based on Swiftkey Dataset

DSS-Captone Milestone Report

Ting-Shuo Yo

August 22, 2016

The Data

Data Exploration