Juan Carlos Mayo

text prediction with R

the model

The model is based on a 4-gram table that collects the most frequent word combinations. The prediction function takes a text, splits it into words and looks for matches in the table

It takes the last three words available - trigrams - and looks for them, returning the fourth one if there is a match

If no match is found it repeats the process using the last two words - bigrams. If no match is found it proceeds to single words - unigrams

If there is no matching unigram, it returns at random three of the most common words used

the model

Matches are sorted by their frequency and the three most frequent results are returned

tetragram_table

performance

The model relies on the data.table format, which makes it really fast to query and return results

The accuracy of the prediction function is limited: using a test corpus of more than 40K sentences. The last word of every sentence was predicted correctly 1.11% of the times

This indicates that the prediction function can be improved following more advanced methods, but which in turn require far more computing power

This simple model is, nevertheless, fast and lightweight and provides reasonable accuracy for common text input

how it works

textbox

result

As the user types in the text box, three suggestions are provided above

Clicking on a suggestion adds it to the text

If no suggestion can be found an empty button is shown
empty

Test it yourself at ShinyApps !