In spite of miniaturization, computerization and technology achievements occurred along the last decades typing text in keyboards remains pretty similar what was a century ago. In portable digital devices like smartphones where the keyboard is extremely small the task became uncomfortable and slow. The popularization of these gadgets pressure the industry to find easier ways of entering text. Predicting the next word a user wants to type with a certain success can reduce the typing significantly.
In this report I describe a proof of concept of such technique that will be implemented as a web application, the algorithm that will drive this application, its constraints and how to measure its effectiveness.
Succinctly the idea is to take a giant database of text, split it in small short phrases, sort them by popularity (frequency of occurrence), and store them in a local database. When a user types a sentence the application takes this partial phrase and looks for the best match it the database, if it can find one it suggest the user the rest of the phrase. This giant database of text is called a corpus.
For this exercise the corpus consists of three files of English texts. The first of the three includes the content of web blogs, the second contains news and the third are tweets. This files were provided by Coursera as part of Data Science Capstone project.
| File | Number of lines |
|---|---|
| en_US.blogs.txt | 899,288 |
| en_US.news.txt | 1,010,242 |
| en_US.twitter.txt | 2,360,148 |
The first thing I did was to split this files in train and test sets. The train set is 50% of the corpus. The rest will be used during the validation and test phase.
The rest of the study was done over the training set and to fairly compare these results with other studies the statistics will be reported in relative not absolute values.
During the pre-processing phase I translate all characters to lowercase and split them in words using regular expressions.
Due to the way these files are organized counting words per line differs significantly among the three. All in all they all share the same rate of words per file size. It can be interpreted as a density: lexical words per bytes.
| File | Words.per.bytes |
|---|---|
| en_US.blogs.txt | 0.1868 |
| en_US.news.txt | 0.1745 |
| en_US.twitter.txt | 0.1909 |
The inverse of these values is not the average English word length in characters, as the total byte values contains punctuation, numbers, symbols, etc.
There are some interesting facts within this corpus.
Counting the occurrence of words and taking the most common ones shows the expected results.
| the | it | are | they | out | who | no | were | year | its | him | u | here | rt | these |
| to | you | but | by | one | more | she | day | first | only | because | got | thanks | before | come |
| and | on | as | will | an | do | some | how | than | over | much | even | need | many | home |
| a | with | he | s | about | had | new | know | see | other | us | want | where | those | little |
| of | was | we | or | what | get | been | love | after | going | too | work | very | say | never |
| i | my | not | just | if | time | our | them | into | think | well | did | most | take | best |
| in | at | so | said | like | there | i’m | people | make | last | really | am | years | m | may |
| for | be | from | up | when | her | it’s | back | also | then | way | right | any | should | week |
| is | this | me | your | has | would | good | which | t | great | today | off | life | being | through |
| that | have | all | his | can | their | now | go | don’t | two | could | still | down | made | while |
However the lemma t surprisingly appears within the most popular words. In this case this was due to the fact that its occurrence adds in many different meanings: t as tablespoon acronym in receipts, t in contractions with missing apostrophe, t as “the” and “to” abbreviations. In this set there’s also the lemma u which of course is the popular abbreviation of the word “you” in informal written language.
Another finding is that the word “obama” has an occurrence similar to other common words like “welcome”. This suggests that the corpus for this kind of applications must be a living entity.
Misspellings and foreign lemmas also show some commonality. “adios” and “tomorow” have the same frequency as the beauty but uncommon word “triangular”.
As you see foreign words were not filtered out. If they appear in an English corpus it would be for a reason.
While frequency occurrences are commonly shown using histograms, in this case I prefer to graph them using cumulative frequencies relative to the total number of words.
It is more readable by graphing the x axis using a logarithm scale.
A layman will be surprised by the linearity of the relationship in this chart. The x axis is in a log scale which implies that the relation exists and is exponential. As a matter of fact this a well known behavior called the Zipf’s law, see http://en.wikipedia.org/wiki/Zipf%27s_law
Those 150 lemmas cover 50% of word use case. While a dictionary of 1,000 words would cover 70% of the most common words, increasing up to 80% would require to triple its size, to 3,000 words approximately.
This property is key to help us to decide the size of the dictionary in the application that better fits the restricted environment.
If instead of one word we take unique pair of words and measure their frequencies we build what is called bi-grams or 2-grams. Plotting them shows a similar pattern although it takes much number of them to obtain the same increase in coverage.
The proof of concept will be implemented as a very simple web application consisting on a text box data entry and two small information panels, one on top and one at the bottom of the text box. The user will type whatever she wants in the text box and the predicted word will appear in the top panel. If she press tab the word will be copied inside the box avoiding the need to type it.
The bottom panel will display some statistics regarding the performance of the application like the numbers of key strokes that were saved for the user, the successful and failure prediction rates.
Internally the application will hold a n-gram dictionary (n-1 words being the predictors, the nth the result) for n from 2 to 4.
If there is no match in the 4-gram the algorithm will fallback the 3-gram and try to match there and so on.
Increasing the coverage while maintaining the memory footprint limited can be done by keeping the internal structures using hashes instead of literal strings.