Overview

In this report, we’re going to examine the dataset provided by the Corpora. Those documents include the blogs, news and twitter messages published in desired language around the world. The target is to give the first exploratory analysis on the given dataset and make a guess on the next prediction step.

For simplicity, we check data in en_US only.

Dataset

The table below gives a quick summary about each of documents.

The summary data shows these files are too large (~ 200 MB) and would consume a lot of memory. Whereas the number of lines varies (Total Lines), the number of words is approximately the same for each file (Total Words). So getting the same proportion of lines in each file to make a combined corpus will ensure that one is not skewed.

Tokenization & Filtering

We apply tokenization to the sample and we get the list of the words it contains. Punctuation characters such as commas, parentheses and the like, under the assumption that they have little impact on word order and n-gram composition, should be removed. Similarity is numeric characters that may appear in numbers and dates. The combination of certain numbers and words can have predictive value, but their frequency might be too low to be useful. All text is converted to lowercase as well.

A few sample texts from each dataset after preprocessing as shown below.

Blogs

## [1] "as in rapture of spirit"                                                                                                                                                                                                                                                                                              
## [2] "under welllit conditions look for any signs of punctures such as nails or shards of glass which might potentially lead to a loss of pressure or a blowout bulging or cracking might also occur on old tires make sure you roll your bike forward in order to see all surface areas that come in contact with the road"
## [3] "difficulty  out of "

News

## [1] "st louis county is set to offer more than  million in lowinterest loans to help homeowners finance energysaving upgrades and lower their monthly utility bills"                                                                                                                                                                                                                                                                                                                                                                                                                           
## [2] "after the victory over michigan this year and georgetown two years ago perhaps the prospect of facing a legendary program has become less intimidating in the case of north carolina the bobcats also have a little history on their side ohio defeated the tar heels  on feb   on north carolinas home turf"                                                                                                                                                                                                                                                                             
## [3] "instead a brief but heated fracas erupted last week in the wake of some illconsidered comments from lee aronsohn cocreator and executive producer of two and a half men etan vlessing of the hollywood reporter published an interview with aronsohn conducted during a screenwriting conference in toronto aronsohn didnt just discuss two and a half men he suggested that there were too many femalecentric shows on tv vlessing quoted aronsohn saying enough ladies i get it you have periods and that were approaching a saturation point for jokes about women and their ladyparts"

Twitter

## [1] "doing my best to avoid checking to see who gets evicted from bb tonight before the west coast broadcast bb butithinkitwillbeshelly"
## [2] "rt  when in doubt listen to s pop music audiocontentment"                                                                          
## [3] "i call that the s  "

N-grams

N-gram is the sequence of N words, by that notion, a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

There’s a lot of differences between the n-grams generated by corpora with and without “stop words”. Though keeping stop words would be more useful in an application like SwiftKey, it’s useless if we have to guess a word in a certain context (like a quizz). So depends on the objectives, we should decide to keep or skip the stop words in corpora.

Unigram Terms

Bigram Terms

Trigram Terms

What’s Next: The Predictive Text Model

We will use these sets of n-grams to create predictive models. They would be processed and splitted in (n-1)-grams and the final word, corresponding to the predictor and the outcome word respectively. A Markov assumption says that the probability of a word depends only on the previous word, thus for bigram model we have the e following approximation.

\[Pr(w_n|w_1^{n-1}) \approx Pr(w_n|w_{n-1})\]

where we represent a sequence of \(n\) words either as \(w_1\dots w_n\) or \(w_1^n\) and the expression \(w_1^{n-1}\) means the string \(w_1,w_2,\dots,w_{n-1}\).

We should also calculate the maximum likelihood estimation for general case of N-gram.

\[Pr(w_n|w_{n-N+1}^{n-1}) = {count(w_{n-N+1}^{n-1}w_n) \over count(w_{n-N+1}^{n-1})}\]

The prediction algorithm takes the entered text, cleans and extracts the preceding 1 to n-1 words, sort by descending maximum likelihood for each n-gram and get top items for the prediction.