Text Prediction

Introduction

In spite of miniaturization, computerization and technology achievements occurred along the last decades typing text in keyboards remains pretty similar what was a century ago. In portable digital devices like smartphones where the keyboard is extremely small the task became uncomfortable and slow. The popularization of these gadgets pressure the industry to find easier ways of entering text. Predicting the next word a user wants to type with a certain success can reduce the typing significantly.

In this report I describe a proof of concept of such technique that will be implemented as a web application, the algorithm that will drive this application, its constraints and how to measure its effectiveness.

The Corpus

Succinctly the idea is to take a giant database of text, split it in small short phrases, sort them by popularity (frequency of occurrence), and store them in a local database. When a user types a sentence the application takes this partial phrase and looks for the best match it the database, if it can find one it suggest the user the rest of the phrase. This giant database of text is called a corpus.

For this exercise the corpus consists of three files of English texts. The first of the three includes the content of web blogs, the second contains news and the third are tweets. This files were provided by Coursera as part of Data Science Capstone project.

File	Number of lines

en_US.blogs.txt	899,288
en_US.news.txt	1,010,242
en_US.twitter.txt	2,360,148

The first thing I did was to split this files in train and test sets. The train set is 50% of the corpus. The rest will be used during the validation and test phase.

Corpus Description

The rest of the study was done over the training set and to fairly compare these results with other studies the statistics will be reported in relative not absolute values.

During the pre-processing phase I translate all characters to lowercase and split them in words using regular expressions.

Due to the way these files are organized counting words per line differs significantly among the three. All in all they all share the same rate of words per file size. It can be interpreted as a density: lexical words per bytes.

File	Words.per.bytes
en_US.blogs.txt	0.1868
en_US.news.txt	0.1745
en_US.twitter.txt	0.1909

The inverse of these values is not the average English word length in characters, as the total byte values contains punctuation, numbers, symbols, etc.

There are some interesting facts within this corpus.

It contains 0.7331818% unique hash-tags.
There are 7.2888562% different words with non English characters,
where 4.0003966% are non-alphabetic distinct lemmas.

Counting the occurrence of words and taking the most common ones shows the expected results.

Table with the first 150 most common words

the

are

they

out

who

were

year

its

him

here

these

you

but

one

she

day

first

only

because

got

thanks

before

come

and

will

some

how

than

over

much

even

need

many

home

with

about

had

new

know

see

other

want

where

those

little

was

what

get

been

love

after

going

too

work

very

say

never

not

just

time

our

them

into

think

well

did

most

take

best

said

there

i’m

people

make

last

really

years

may

for

from

when

her

it’s

back

also

then

way

right

any

should

week

this

your

has

would

good

which

great

today

off

life

being

through

that

have

all

his

can

their

now

don’t

two

could

still

down

made

while

However the lemma t surprisingly appears within the most popular words. In this case this was due to the fact that its occurrence adds in many different meanings: t as tablespoon acronym in receipts, t in contractions with missing apostrophe, t as “the” and “to” abbreviations. In this set there’s also the lemma u which of course is the popular abbreviation of the word “you” in informal written language.

Another finding is that the word “obama” has an occurrence similar to other common words like “welcome”. This suggests that the corpus for this kind of applications must be a living entity.

Misspellings and foreign lemmas also show some commonality. “adios” and “tomorow” have the same frequency as the beauty but uncommon word “triangular”.

As you see foreign words were not filtered out. If they appear in an English corpus it would be for a reason.

While frequency occurrences are commonly shown using histograms, in this case I prefer to graph them using cumulative frequencies relative to the total number of words.

It is more readable by graphing the x axis using a logarithm scale.

A layman will be surprised by the linearity of the relationship in this chart. The x axis is in a log scale which implies that the relation exists and is exponential. As a matter of fact this a well known behavior called the Zipf’s law, see http://en.wikipedia.org/wiki/Zipf%27s_law

Those 150 lemmas cover 50% of word use case. While a dictionary of 1,000 words would cover 70% of the most common words, increasing up to 80% would require to triple its size, to 3,000 words approximately.

This property is key to help us to decide the size of the dictionary in the application that better fits the restricted environment.

If instead of one word we take unique pair of words and measure their frequencies we build what is called bi-grams or 2-grams. Plotting them shows a similar pattern although it takes much number of them to obtain the same increase in coverage.

The application

The proof of concept will be implemented as a very simple web application consisting on a text box data entry and two small information panels, one on top and one at the bottom of the text box. The user will type whatever she wants in the text box and the predicted word will appear in the top panel. If she press tab the word will be copied inside the box avoiding the need to type it.

The bottom panel will display some statistics regarding the performance of the application like the numbers of key strokes that were saved for the user, the successful and failure prediction rates.

The engine inside the application

The application will ignore all punctuation symbols except for the apostrophe.
Any other things excepts words like numbers will be ignored.
There will no special treatment for foreign languages. If some terms proves to be popular it will be included.
Same for typos.
All words will be in lower case. (see 7)
Profanities won’t be filtered out but,
A dictionary will translate proper names to uppercase and profanities with special characters.
After a stop the next suggested word will be capitalized.

Internally the application will hold a n-gram dictionary (n-1 words being the predictors, the nth the result) for n from 2 to 4.

If there is no match in the 4-gram the algorithm will fallback the 3-gram and try to match there and so on.

Increasing the coverage while maintaining the memory footprint limited can be done by keeping the internal structures using hashes instead of literal strings.