Overview
Load and PreProcess Data
- Word Tokenizing and Normalization
- Final Document Summary
More Summary Statistics
- Document Term Matrix
- Summary Plots
Prediction Application

Overview

Complete Code May Be Viewed in this Github Location

As engagement with mobile devices becomes second nature for us, we’re experiencing more effective means of capturing the intended content of communication on these devices, viz. dynamic spell-checkers, swipe-able keyboards, and next word predictors. The aim of this analysis is to lay the foundations for a word prediction app. that may be used in a similar vain to that of Swiftkey’s next word prediction feature in their keyboard application.

Specifically, this report’s objectives are the following:
* Load and clean the input datasets required to train the prediction model,
* Explore summary statistics and analyze the dataset for other revelations,
* Layout a blueprint for the prediction algorithm and the prediction application.

Load and PreProcess Data

This exercise uses the files named LOCALE.blogs.txt where LOCALE is each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. From the site: HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language.

I first read in the text files and unpack them into what turns out to be very large Corpora containing the following corpus types:

Blog Text: A collection of blog posts.
News Text: A collection of news articles.
Twitter Tweets: A collection of Tweets.

Before we can collect some summary statistics for each, some preliminary cleanup is needed. Specifically:

Word Tokenizing and Normalization

Each of the corpus(es?) in the Corpora are stored in three character vectors in the “corp” logical object. In order to get a better understanding of each, it is necessary to first perform the following pre-processing steps:

Force Lowercase
Remove Numbers
Remove Punctuation Characters: Attempt to remove special characters that result in emoticons, that are apostrophes, periods, commas, semicolons, excessive punctuation. Exceptions: Preserve the octothorpe “#” character in the Tweets corpus to analyze significance of tweeted hashtag context.
Strip Unnecessary Whitespaces

Document Summaries

The following table shows the approximate token counts for each document type, after the steps above:

document.source	size_MB	lines	tokens	types
Blogs	239	899,288	36,818,677	874,912
News	239	1,010,242	33,476,199	890,333
Twitter	289	2,360,148	29,421,345	1,535,213

Stop Words

To get a better type grouping, and for better frequency analysis, we get rid of common stop words such as were, ours, doesn't, aren't, at, isn't, that, theirs, above, out, etc.

Stemming

Finally, we chop the word affixes from their stems to improve type grouping and frequency analysis.

Lemmatization

We choose not to perform any lemmatization on account of the rather generic problem domain.

Final Document Summary

After all our pre-processing is complete, we see the following document summary:

document.source	size_MB	lines	tokens	types
Blogs	177	899,288	19,359,446	617,781
News	189	1,010,242	19,657,775	648,259
Twitter	248	2,360,148	17,464,307	1,250,967

More Summary Statistics

On account of the Corpora being so large, I’ll have to explore stats on a random sample. I’ll create a “sub” corpora, sampling 2% of the original corpus.

Document Term Matrix

First, generate the sub-corpora, then the document term matrix, which will give us the frequency of terms. The 20 most frequent terms in the this sample corpus are:

will	6574
one	6086
like	6059
just	6037
get	5987
said	5958
time	5234
can	5153
day	4466
year	4382
make	4079
love	4002
new	3882
know	3718
good	3677
now	3633
dont	3548
work	3542
say	3316
want	3203

Summary Plots

50 of the most frequently occuring words in the sample:

Finally, here’s a wordcloud of words that occur at least 100 times or more in the sample corpora:

Prediction Application

So far I’ve settled on using a multi-gram Bayesian model to act as the predictor for the word prediction application. I’ll experiment with tri-grams and quad-grams with the Markov assumption in place. Basically, the prediction model will be computing something like: P(next word | bi-gram/tri-gram/etc.).

End of Report