Complete Code May Be Viewed in this Github Location
As engagement with mobile devices becomes second nature for us, we’re experiencing more effective means of capturing the intended content of communication on these devices, viz. dynamic spell-checkers, swipe-able keyboards, and next word predictors. The aim of this analysis is to lay the foundations for a word prediction app. that may be used in a similar vain to that of Swiftkey’s next word prediction feature in their keyboard application.
Specifically, this report’s objectives are the following:
* Load and clean the input datasets required to train the prediction model,
* Explore summary statistics and analyze the dataset for other revelations,
* Layout a blueprint for the prediction algorithm and the prediction application.
This exercise uses the files named LOCALE.blogs.txt where LOCALE is each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. From the site: HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language.
I first read in the text files and unpack them into what turns out to be very large Corpora containing the following corpus types:
Blog Text: A collection of blog posts.
News Text: A collection of news articles.
Twitter Tweets: A collection of Tweets.
Before we can collect some summary statistics for each, some preliminary cleanup is needed. Specifically:
Each of the corpus(es?) in the Corpora are stored in three character vectors in the “corp” logical object. In order to get a better understanding of each, it is necessary to first perform the following pre-processing steps:
Force Lowercase
Remove Numbers
Remove Punctuation Characters: Attempt to remove special characters that result in emoticons, that are apostrophes, periods, commas, semicolons, excessive punctuation. Exceptions: Preserve the octothorpe “#” character in the Tweets corpus to analyze significance of tweeted hashtag context.
Strip Unnecessary Whitespaces
The following table shows the approximate token counts for each document type, after the steps above:
| document.source | size_MB | lines | tokens | types |
|---|---|---|---|---|
| Blogs | 239 | 899,288 | 36,818,677 | 874,912 |
| News | 239 | 1,010,242 | 33,476,199 | 890,333 |
| 289 | 2,360,148 | 29,421,345 | 1,535,213 |
To get a better type grouping, and for better frequency analysis, we get rid of common stop words such as were, ours, doesn't, aren't, at, isn't, that, theirs, above, out, etc.
Finally, we chop the word affixes from their stems to improve type grouping and frequency analysis.
We choose not to perform any lemmatization on account of the rather generic problem domain.
After all our pre-processing is complete, we see the following document summary:
| document.source | size_MB | lines | tokens | types |
|---|---|---|---|---|
| Blogs | 177 | 899,288 | 19,359,446 | 617,781 |
| News | 189 | 1,010,242 | 19,657,775 | 648,259 |
| 248 | 2,360,148 | 17,464,307 | 1,250,967 |
On account of the Corpora being so large, I’ll have to explore stats on a random sample. I’ll create a “sub” corpora, sampling 2% of the original corpus.
First, generate the sub-corpora, then the document term matrix, which will give us the frequency of terms. The 20 most frequent terms in the this sample corpus are:
| will | 6574 |
| one | 6086 |
| like | 6059 |
| just | 6037 |
| get | 5987 |
| said | 5958 |
| time | 5234 |
| can | 5153 |
| day | 4466 |
| year | 4382 |
| make | 4079 |
| love | 4002 |
| new | 3882 |
| know | 3718 |
| good | 3677 |
| now | 3633 |
| dont | 3548 |
| work | 3542 |
| say | 3316 |
| want | 3203 |
50 of the most frequently occuring words in the sample:
Finally, here’s a wordcloud of words that occur at least 100 times or more in the sample corpora:
So far I’ve settled on using a multi-gram Bayesian model to act as the predictor for the word prediction application. I’ll experiment with tri-grams and quad-grams with the Markov assumption in place. Basically, the prediction model will be computing something like: P(next word | bi-gram/tri-gram/etc.).
End of Report