Building the Next Word Prediction Engine from scratch

D.Dakhno
06. Oct. 2016

JHU Data Science Specialization Capstone Project (in cooperation with SwitfKey)

Challenges of real world English processing or "Garbage In, Garbage Out"

Have fun! 頑張ってね~~

Hit Chart Elegy (音楽番付哀歌)
Just Like Great Britain’s–Unfortunately Bibi on Fox News Две ракеты были выпущены по южной территории #Израиль из #СекторГаза
★ 12月30日 Awoi Instore at Zeal Link Shibuya
…fermented red beancurd or hong fu ru 红腐乳…
…to order their debut EP, 플레이걸의 24時.
Xi Lu Han is the full name of Lu Han (루한 / 鹿 晗)
..LOL we may be getting an adlut
♥♥ I hope !
He loves YOU!(((: #His<3isbettter

Regular English by itself offers fast endless plenty of grammatical, syntactical and lexical forms and combinations with extremely rich semantic background. In its function of lingua franca in the times of globalization and Internet, it multiplies the variability of the lexical forms an analyst has to deal with.

The most challenging step in the course of analysis of a given text corpus HC corpus for the purposes of prediction was therefore preprocessing of the text, trying to remove the “garbage”, reduce the variability but avoiding extreme bias as a result. The corpus includes patterns from different segments of Internet (news, blogs, Twitter).

The main steps of text preprocessing in this project were:

  • removing numbers
  • trimming excessive white spaces
  • removing lines with words from languages like Chinese, Japanese, Russian, Hebrew and so on
  • converting the punctuation marks to standard where possible
  • removing “words” consisting of non-letter characters (though, hash-tags and simple smileys accepted)
  • reducing variability through leaving out articles (the, a , an)

Predictive model and algorithm

N-Gramms model has been chosen as a basis for prediction engine. Such models are widely used in statistical natural language processing. They are typically collected from a text or speech corpus. So did we as well (HC Corpus, available over Coursera/SwiftKey).

The following principles were implemented for building n-gram databases:

  • Building data bases of mono, bi-, tri-, four- and pentagrams.
  • End of the sentence or of the line was used as a natural break point building the n-grams.
  • Data bases were packed by the length in the R data tables, splited into predictor (all words of n-gram but last), outcome (the last word of n-gram) plus detected frequency. The format is directly usable by prediction engine.
  • Because of high memory consumption only 15% of the text corpus could be processed at a time. With random sampling and 21 processing iterations the lines coverage in corpus of 97% has been reached (mean redundancy 3.15) . The n-gram databases were then then merged into one per n-gram length.
  • Working data bases were created, trimming the over-represented predictors (like “to be”) to upper 150 outcomes by frequency and removing rare combinations.

Using the data bases created, following predictive model has been implemented:

  • Maximal length of an expression took as a predictor is four words.
  • As the first step, the decision is made towards either next-word or in-word prediction (trailing blanks in the input string?); a flag for possible article as a last word will be set.
  • For next-word search, the line trimmed the same way as the text corpus is used as the subject for exact search in the indexed data tables, from longer to shorter predictors downwards. Each of the following steps is to be performed, when preceding one still has brought no suggestions.
  • Exact search after trimming by endings like “ 'll”, “ 're” and so on.
  • False input of the last word (typo?) assumed; the search is to be repeated iteratively, shortening the last word by one letter.
  • The original predictor expression is shortened by the first word; the rest becomes the subject of exact search and other steps, listed above.
  • The last word standing goes into exact search in the used words database, then into in-word prediction.
  • The in-word prediction make the search in the data base of words used in text corpus, followed by the reference word list if needed. Exact search and search for the words, beginning with the given pattern.
  • In in-word prediction, the relative frequency of the findings in the data bases as well as eventually preceding n-gram (for in-word prediction) count by sorting of the output.

Quick start

Benchmarking prediction engine

Benchmarking of statistical and technical performance of the Next Word Prediction Engine has been conducted using the n-grams derived from the Corpus of Contemporary American English (one million most frequent pentagrams, case insensitive).

  • A set of one thousand randomly selected patterns was applied to prediction engine in five different setups, with respect to the size of n-gram data bases and having possible savings for modestly equipped gadgets in mind.
  • For each test pentagram the sequence of first four words served as expression to be put into prediction. For this four-gram all available completions (last word) were investigated in the test set and sorted descendant by frequency.
  • As statistical accuracy was taken the grade of coverage of thus expected outcomes by the suggestions from prediction engine, as well as the tie of ranks in both groups.
Mbyte nr.mono nr.bi nr.tri nr.four nr.penta accuracy by_ranking mean_time median_time sd_time
1 932.87 292005 1576301 4535312 3652975 5320886 0.5981 0.5548 0.0087 0.005 0.0317
2 4141.19 422469 3379408 16324126 21729539 20934044 0.6009 0.5319 0.0082 0.005 0.0414
3 9078.07 861463 5360592 32419383 48965451 49787933 0.6546 0.5281 0.0087 0.004 0.0799
4 8262.96 292005 1576301 16324126 48965451 49787933 0.6523 0.5247 0.0071 0.004 0.0321
5 1603.60 861463 5360592 16324126 3652975 5320886 0.6058 0.5572 0.0102 0.005 0.0789

Larger data bases show about 10% higher accuracy for the price of the ten-fold higher memory consumption (lines 1. to 3. in the table). The comparison of the the right-loaded setup (larger four- and pentagramm databases, smaller for shorter predictors - line 4.) versus the left-loaded reveals the prevalence of the first one (one more time, for the price of RAM consumption).

Some surprisingly, the average execution time of predictions were not affected by the size of the data bases (some higher times in the last setup are apparently due to more drilling down through the data bases needed for prediction). The relative short response times are due to the choice of indexed data tables as a storage mechanism.

For the real world scenarios like typing in a phrase letter by letter caching mechanisms have been implemented, reducing the load on the data bases and response times of fuzzy search (queries with %like%), especially of in-word prediction.

The results are noway the final truth, but pioneer the way to the better and parsimonious word prediction models and implementations.

Congratulations to all graduates!

We have overcome!

We are the champions, my friends!

1K THNX to Brian, Roger and Jeff♥♥♥

'm assembling a new monster PC &…

…GTSY by Kaggle ;)

BR

DD