Overview

Complete Code May Be Viewed in this Github Location

As engagement with mobile devices becomes second nature for us, we’re experiencing more effective means of capturing the intended content of communication on these devices, viz. dynamic spell-checkers, swipe-able keyboards, and next word predictors. The aim of this analysis is to lay the foundations for a word prediction app. that may be used in a similar vain to that of Swiftkey’s next word prediction feature in their keyboard application.

Specifically, this report’s objectives are the following:
* Load and clean the input datasets required to train the prediction model,
* Explore summary statistics and analyze the dataset for other revelations,
* Layout a blueprint for the prediction algorithm and the prediction application.

Load and PreProcess Data

This exercise uses the files named LOCALE.blogs.txt where LOCALE is each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. From the site: HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language.

I first read in the text files and unpack them into what turns out to be very large Corpora containing the following corpus types:

Before we can collect some summary statistics for each, some preliminary cleanup is needed. Specifically:

Word Tokenizing and Normalization

Each of the corpus(es?) in the Corpora are stored in three character vectors in the “corp” logical object. In order to get a better understanding of each, it is necessary to first perform the following pre-processing steps:

  • Force Lowercase

  • Remove Numbers

  • Remove Punctuation Characters: Attempt to remove special characters that result in emoticons, that are apostrophes, periods, commas, semicolons, excessive punctuation. Exceptions: Preserve the octothorpe “#” character in the Tweets corpus to analyze significance of tweeted hashtag context.

  • Strip Unnecessary Whitespaces

Document Summaries

The following table shows the approximate token counts for each document type, after the steps above:

document.source size_MB lines tokens types
Blogs 239 899,288 36,818,677 874,912
News 239 1,010,242 33,476,199 890,333
Twitter 289 2,360,148 29,421,345 1,535,213

Stop Words

To get a better type grouping, and for better frequency analysis, we get rid of common stop words such as were, ours, doesn't, aren't, at, isn't, that, theirs, above, out, etc.

Stemming

Finally, we chop the word affixes from their stems to improve type grouping and frequency analysis.

Lemmatization

We choose not to perform any lemmatization on account of the rather generic problem domain.

Final Document Summary

After all our pre-processing is complete, we see the following document summary:

document.source size_MB lines tokens types
Blogs 177 899,288 19,359,446 617,781
News 189 1,010,242 19,657,775 648,259
Twitter 248 2,360,148 17,464,307 1,250,967

More Summary Statistics

On account of the Corpora being so large, I’ll have to explore stats on a random sample. I’ll create a “sub” corpora, sampling 2% of the original corpus.

Document Term Matrix

First, generate the sub-corpora, then the document term matrix, which will give us the frequency of terms. The 20 most frequent terms in the this sample corpus are:

will 6574
one 6086
like 6059
just 6037
get 5987
said 5958
time 5234
can 5153
day 4466
year 4382
make 4079
love 4002
new 3882
know 3718
good 3677
now 3633
dont 3548
work 3542
say 3316
want 3203

Summary Plots

50 of the most frequently occuring words in the sample:

Finally, here’s a wordcloud of words that occur at least 100 times or more in the sample corpora:

Prediction Application

So far I’ve settled on using a multi-gram Bayesian model to act as the predictor for the word prediction application. I’ll experiment with tri-grams and quad-grams with the Markov assumption in place. Basically, the prediction model will be computing something like: P(next word | bi-gram/tri-gram/etc.).

End of Report

Complete Code May Be Viewed in this Github Location