Sentence Builder Proposal

Francois Schonken
2015-04

A simple word prediction algorithm optimised for limited memory footprint environments (i.e. mobile phones and tablets).

Data Preparation

The training data used is the HC Corpora data set. For the purposes of this Proof of Concept only the United States English data was used.

The blog, news and twitter data was combined. The data was then split into sentences (to avoid cross sentence boundry n-gram creation), lower cased, number trimed, punctuation and white space stripped.

Next the 2-gram, 3-gram and 4-gram data was compiled followed by start of sentence word listing so we can reasonably predict the first word in a sentence.

Algorithm

The prediction algorithm implements stupid backoff starting on 4-gram data backing off to 3-gram data backing off to 2-gram data.

Lazy (late) profanity filtering is implemented over a predefined list of words, found here. This gives us the option of either enabling or disabling the filter in run time

Compressing and Pruning

The biggest challenge in implementing this Proof of Concept was working with a constrained memory footprint. In an effort to conserve memory an integer number is assigned to each word.

The n-gram tables are constructed using these numbers making them much smaller than a string based representation.

Lastly the 3-grams and 4-grams with occurences of less than 2 were pruned.

Memory Footprint

ngram.1 =  25.88 MB,   396,465 rows
ngram.2 =   6.05 MB,   793,052 rows
ngram.3 =  60.75 MB, 5,308,970 rows
ngram.4 =  45.61 MB, 2,989,154 rows
Total   = 138.30 MB

Benchmark Performance

Overall top-3 score:     16.44 %
Overall top-1 precision: 11.78 %
Overall top-3 precision: 20.79 %
Average runtime:         606.76 msec
Total memory used:       139.57 MB

Dataset details
 Dataset "blogs" (599 lines, 14587 words)
  Score: 13.39 %, Top-1 precision: 9.13 %, Top-3 precision: 17.28 %
 Dataset "quizzes" (20 lines, 323 words)
  Score: 22.11 %, Top-1 precision: 16.50 %, Top-3 precision: 27.39 %
 Dataset "tweets" (793 lines, 14011 words)
  Score: 13.81 %, Top-1 precision: 9.71 %, Top-3 precision: 17.69 %

alt text

Proof of Concept
Select your profanity preference, type your partial sentence in the input field, press Enter and let Sentence Builder help you finish your sentence. The predicted words are listed in green to the right of the input field.

The Proof of Concept is hosted on shinyapps.io. Give it a try.