1. Word Predictor

The word predictor app is a prediction model based on 854,264 samples of text from such sources as

  • Blog posts
  • News articles
  • Twitter messages

Based on these samples it can predict the next word in a given sentence.
There are several hurdles that needed to be taken for this project, memory limits were a large concern and circumvented with a disk-only solution.

2. Functionality

  • Maintaining a KISS (Keep It Simple Stupid) principle, the app has one input and one output field.
  • The user pastes the example into the text field and the original text, including the predicted word will appear below.
  • The backend library can be reused with any front-end. Just a single function called “SearchString” is used.
  • Response speed is extremely fast due to the implemented storage mechanism.

3. Data selection

  • Data is cleaned and processed with the tm package
  • Foreign words are filtered out with the help of an English dictionary file
  • N-grams are created for two, three and four letter combinations
  • Per combination total occurances are counted and stored
  • The last word is kept separate and used as the predicted word

4. Probability determination

  • Frequences of occurances are used for each n-gram selection to determine probability
  • These probabilities are smoothed with the Turing-Good estimation
  • Probabilities for smaller n-grams are included in higher n-grams through linear interpolation
    • Example “I love to eat” and “I love to contemplate” both occur once.
      • “love to eat” occurs 4 times, but “love to contemplate” only once.
      • “I love to eat” get's a higher probability because the probability of “love to eat” is included

5. Data storage and performance

  • Data is stored in an indexed SQLite database
  • By normalizing data, we save all potential words in a single “words” table. N-gram tables only contain integer references to the words table.
  • This saves in space and dramatically increases speed. Allowing for an efficient database of 287 MB consisting of:
    • 2,454,543 4-grams
    • 1,034,878 3-grams
    • 50,757 2-grams
  • Memory requirements are less than 20 mb (excluding base R functions)
  • Performance of 500 searches averages at around 60 milliseconds per search