We report the progress of the ongoing project that we are developing to support mobile device users in their typing actions. After a brief synopsis and description of the approach followed, we include the results of the exploratory analysis made on the datasets that will be used to build up our application. As a result, we detail the different actions that we are doing in order to clean and normalize those datasets. These actions have been tested on a sample for evaluating potential challenges raised due to the size of the datasets. Finally, we consider some actions to cope with these challenges, and summarize the next steps.
In this project, we are building a predictive text application for mobile devices that can help users while they are typing. The application will suggest to the user a list with the three more probable words coming after the word sequence typed so far (one-, two- or three-words sequences). That ranked list will be built up based on a probabilistic language model (using the modeling technique called “N-grams”) trained with several collections of real texts extracted from a variety of sources: news, blogs, and twitter messages.
From a corpus of texts, a N-grams language model collects all the different sequences of N words (thus, unigrams or 1-word sequence; bigrams or 2-words; trigrams or 3-words, etc.) existing in the texts, togheter their frequencies and probabilities.
With these N-grams frequency datasets we can easily build a set of tables referencing the three most probable continuations for a concrete sequence of words. Such ranked next-word tables are the only data that our predictive application needs. We have at our disposal several analytics and text mining tools in the R platform to obtain the corresponding tables for unigrams, bigrams, trigrams, 4-grams, etc. Once obtained, we can implement the predictive engine in our application.
As an example about how it will work, suppose that the user has typed I could have, a 3-words sequence. We must look for the next probable word. In the corresponding ranked table (obtained from the 4-grams frequency dataset) we find the following rows:
| Typed word sequence (3 words) | Most probable next words (ordered) |
|---|---|
| … | …. |
| I could have | 1. used, 2. known, 3. done |
| … | …. |
| I returned to | 1. the, 2. work, 3. school |
| … | …. |
So we propose to the user the words used, known, and done, in this order.
But it could also happen that the sequence typed by the user is not in the corresponding table. Suppose now that the user typed They presented in, which is not contained in the 3-words ranked table. In this case we will “drop” our search to the 2-words ranked table, ignoring the first word of our sequence. We will find:
| Typed word sequence (2 words) | Most probable next words (ordered) |
|---|---|
| … | …. |
| presented in | 1. the, 2. a, 3. order |
| … | …. |
The process continues down this path until a match is found. This strategy is called “backoff” and it covers the fact that the word sequence typed by the user not always has to appear in the training data used to build up the language model. In the worst case, when no word has been found (the “Out-of-vocabulary problem”“, OOV), we have algorithms (for instance,”Good Turing“) to assign probabilities, and then ranks, to OOV words.
The language model used for making the prediction will be trained from data coming from three different sources:
NewsBlogsTwitterGiven the different origins, these three sources could be quite variable in some basic characteristics.
Each line in the source file becomes one element in the corresponding dataset. In the case of twitter each element is a message (with the known limit of 140 characters). In blogs, it is a published post or entry with, in some cases, the comments. In news, it is a piece of information about a relevant event.
The following table summarizes basic indicators of each source:
| Source | Elements | Characters | Element avg. size (in chars) |
|---|---|---|---|
| News | 77,259 | 15,639,408 | 202.43 |
| Blogs | 899,288 | 206,824,505 | 229.99 |
| 2,360,148 | 162,096,031 | 68.68 |
As can be seen, blogs and twitter are much bigger than news. This raises the caveat that the more formal style of news texts could be underrepresented in the prediction model. The average size of each element is also different, being twitter texts clearly shorter than the rest, as expected.
The distribution of element’s size could also be much skewed depending of the source, as it can be seen in this boxplot graph:
Blogs element’s sizes are extremely skewed to the right (many outliers with big sizes), while twitter seems more homogeneous (please, take into account that sizes are represented in a logarithmic scale due to their big range).
For the purpose of considering next steps in this project, we have carried out a preliminary inspection of the texts contents of each source. As the datasets are really big (mainly blogs and twitter) the task has been done over a sample of 20% of the elements.
The goal of this exploration is to identify potential problems that could condition the phase where we will extract the words and sequences of words from the text. Apart from the most obvious issues -lower/upper case differences, profanities, numbers, links, etc.-, the following ones should be solved before any process be tackled (please, be aware that, without changing some parameters, non-printable characters in html are represented by their Unicode code in the form <U+XXXX>):
[1] "Hey,please follow me! I love you with all my soul Austin<U+2661><U+2661><U+2661><U+2661><U+2661><U+2661> 102"
[1] "So without further ado<U+0085>Rudyard Kipling<U+0092>s <U+0093>If<U+0094> (with my favorite lines underlined)<U+0085>.."
[1] "If you like reality shows, good-looking teenagers killing each other, and Lenny Kravitz <U+2015>he doesn<U+0092>t sing but exudes sex appeal<U+2015> in a minor role, <U+0093>The Hunger Games<U+0094> is for you. Donald Sutherland does a great job playing the bad guy."
[1] "<U+6D74><U+8863><U+3084><U+6CD5><U+88AB><U+3092><U+304A><U+6301><U+3061><U+306E><U+65B9><U+306F><U+3001><U+796D><U+306B><U+7740><U+3066><U+304D><U+3066><U+4E0B><U+3055><U+3044><U+3002><U+266A><U+266A>"
Thus, the cleaning process, including the standardization of characters that play a similar role in the sentence, is a prerequisite before any attempt of producing the N-gram frequency datasets is undertaken.
Taking into account usual procedures plus the issues commented in the previous section, we have decided to clean and normalize the texts removing and/or changing concrete characters or words. When compared with other kind of text processing applications, we have decided to maintain what are called “stopwords” (common pronouns, prepositions, linking words, etc.) as they are a substantial part of any sentence. Also, the contracted verbal forms (I’m, don’t, won’t, etc.) are considered as single words.
Fortunately we have a very powerful instrument for accomplishing the task, regular expressions, which can detect any character’s pattern in a text. Based on this instrument, in our project we are applying the following cleaning and normalizing procedures:
Removing any element with more than 50% of non-English (Latin) characters as they will be for sure in a foreign language.
Removing references to web pages, email addresses, external links, etc. They are not interpretable in terms of words.
Removing numbers, special characters, control symbols, etc. for the same reason.
Removing swear words and profanities (using the unofficial Google/Android profanities list).
Replacing any uppercase letter to its lowercase form, to count together any form of a word.
Normalizing all the quotation marks options to one single standard character, the apostrophe. Once this done, removing all the apostrophes except those used in contracted verb forms.
Normalizing all the hyphen/dashes to one single standard character: the simple hyphen. Once this done, removing all the hyphens except intra-word hyphens.
Normalizing any punctuation that can signal the end of a sentence (|.|;|:|?|!|) to a single mark (“<s/>”, in our case.). Removing the rest of punctuation signs.
The last step, marking carefully end of sentences, is particularly relevant for avoiding misleading predictions of words that never can be together, although they can appear as that in the training texts if we remove indiscriminately punctuation during the cleaning phase.
As it was explained before, N-grams are the different sequences of N words that appears in the text. After the cleaning and normalization of the sources, and for the purpose of having a glimpse of the effort and necessary resources, we have generated the corresponding N-grams (until four) for the sample datasets. The next table summarizes the results:
| Source | Elements | Characters | Sentences | Words | Unigrams | Bigrams | Trigrams | 4-grams |
|---|---|---|---|---|---|---|---|---|
| news | 15,451 | 3,127,143 | 34,704 | 547,389 | 38,933 | 268,572 | 438,471 | 463,763 |
| blogs | 179,819 | 42,204,640 | 513,414 | 7,878,578 | 157,269 | 1,877,426 | 4,593,103 | 5,981,724 |
| 472,029 | 32,919,387 | 749,608 | 6,583,532 | 168,884 | 1,528,388 | 3,432,432 | 4,220,265 |
We can observe that, as the value of N increases, N-grams sizes grow very quickly. Also that the number of distinct single words (unigrams) detected depends of the size of the source texts (in fact, it grows following the Heap’s law).
This means that when we will move from the samples to the full sources, producing the N-gram frequency datasets can be very costly task in terms of CPU and memory. We will need to consider how to cope with this challenge; in a later section we will introduce some alternatives
From the unigram table we can show the 15 most frequent words, globally and for each source:
As might be expected, the most frequent words are what we mention before: the usual English “stopwords”. This means that the sources are, let say, “average-English” examples well capable of being used as training datasets.
Interestingly, if we analyze frequent words now for each source, we can detect some characteristics patterns that allow us to identify the origin:
We can see the relevance of words like i, my, me or you in twitter texts compared with news (or even blogs), given its “conversational” character. On the contrary, words very usual in reported speech like said, was or he are in the list for news but not in the twitter one. Detecting this kind of patterns is the base of other text predictive application (for instance, classification), out of the scope of the current project.
Once sources are cleaned, the main next steps in our project are basically:
Producing the full N-gram frequency datasets.
From them, building up the ranked next-word tables with the list of the three most probable continuation given a previous sequence.
Developing the predictive application based on those ranked tables.
However, considering the requisite of minimize the size of these tables (we are talking about mobile device with limited memory resources), we should evaluate carefully how to produce the seeds of them: the full N-grams frequency datasets.
First of all, let us state again that what is deployed to the mobile device are the ranked tables, which only include a limited number of rows compared to the frequency datasets. So their size is considerable smaller, although it depends on the number of the N-grams detected.
In any case, the main problem is the cost of producing the different N-grams frequency datasets. We have two options (or, just in case, a combination of both):
Limiting the value of N, that is, working only with uni-, bi- and trigrams, leaving out 4-grams.
Pruning the N-grams by removing all the sequences below a threshold frequency.
The most systematic way of deciding about these questions is evaluating how well each alternative predicts a sample. There is a measure called ‘perplexity’ that allow to compare probabilistic models from that point of view: a lower perplexity indicates that the model is better.
In our case, the plan is to proceed with a “data science” approach:
To split the source texts in train and test subsets.
To train the language models under evaluation with the train subset.
To estimate their perplexity over the test subset.
In this way we can decide if adding a 4-grams model increases in a relevant percentage the predictive power of the application. Or if after pruning an N-gram from low frequency cases, the loss of power is negligible.