1. Word Predictor

The word predictor app is a prediction model based on 854,264 samples of text from such sources as

Blog posts
News articles
Twitter messages

Based on these samples it can predict the next word in a given sentence.
There are several hurdles that needed to be taken for this project, memory limits were a large concern and circumvented with a disk-only solution.

2. Functionality

Maintaining a KISS (Keep It Simple Stupid) principle, the app has one input and one output field.
The user pastes the example into the text field and the original text, including the predicted word will appear below.
The backend library can be reused with any front-end. Just a single function called “SearchString” is used.
Response speed is extremely fast due to the implemented storage mechanism.

3. Data selection

Data is cleaned and processed with the tm package
Foreign words are filtered out with the help of an English dictionary file
N-grams are created for two, three and four letter combinations
Per combination total occurances are counted and stored
The last word is kept separate and used as the predicted word

4. Probability determination

Frequences of occurances are used for each n-gram selection to determine probability
These probabilities are smoothed with the Turing-Good estimation
Probabilities for smaller n-grams are included in higher n-grams through linear interpolation
- Example “I love to eat” and “I love to contemplate” both occur once.
  - “love to eat” occurs 4 times, but “love to contemplate” only once.
  - “I love to eat” get's a higher probability because the probability of “love to eat” is included

5. Data storage and performance

Data is stored in an indexed SQLite database
By normalizing data, we save all potential words in a single “words” table. N-gram tables only contain integer references to the words table.
This saves in space and dramatically increases speed. Allowing for an efficient database of 287 MB consisting of:
- 2,454,543 4-grams
- 1,034,878 3-grams
- 50,757 2-grams
Memory requirements are less than 20 mb (excluding base R functions)
Performance of 500 searches averages at around 60 milliseconds per search