A simple word prediction algorithm optimised for limited memory footprint environments (i.e. mobile phones and tablets).
The training data used is the HC Corpora data set. For the purposes of this Proof of Concept only the United States English data was used.
The blog, news and twitter data was combined. The data was then split into sentences (to avoid cross sentence boundry n-gram creation), lower cased, number trimed, punctuation and white space stripped.
Next the 2-gram, 3-gram and 4-gram data was compiled followed by start of sentence word listing so we can reasonably predict the first word in a sentence.
The prediction algorithm implements stupid backoff starting on 4-gram data backing off to 3-gram data backing off to 2-gram data.
Lazy (late) profanity filtering is implemented over a predefined list of words, found here. This gives us the option of either enabling or disabling the filter in run time
The biggest challenge in implementing this Proof of Concept was working with a constrained memory footprint. In an effort to conserve memory an integer number is assigned to each word.
The n-gram tables are constructed using these numbers making them much smaller than a string based representation.
Lastly the 3-grams and 4-grams with occurences of less than 2 were pruned.
ngram.1 = 25.88 MB, 396,465 rows ngram.2 = 6.05 MB, 793,052 rows ngram.3 = 60.75 MB, 5,308,970 rows ngram.4 = 45.61 MB, 2,989,154 rows Total = 138.30 MB
Overall top-3 score: 16.44 % Overall top-1 precision: 11.78 % Overall top-3 precision: 20.79 % Average runtime: 606.76 msec Total memory used: 139.57 MB Dataset details Dataset "blogs" (599 lines, 14587 words) Score: 13.39 %, Top-1 precision: 9.13 %, Top-3 precision: 17.28 % Dataset "quizzes" (20 lines, 323 words) Score: 22.11 %, Top-1 precision: 16.50 %, Top-3 precision: 27.39 % Dataset "tweets" (793 lines, 14011 words) Score: 13.81 %, Top-1 precision: 9.71 %, Top-3 precision: 17.69 %
Proof of Concept
Select your profanity preference, type your partial sentence in the input field, press Enter and let Sentence Builder help you finish your sentence. The predicted words are listed in green to the right of the input field.
The Proof of Concept is hosted on shinyapps.io. Give it a try.