Next word prediction

Coursera JHU Data Science specialisation

Capstone Project A shiny application predicting next word based on previous words entered by the user. Idea based on Swiftkey keyboard for tablets and smartphones

Visit my app here
For questions email me at mkrous@gmail.com

The application

User enters some words in a textfield and hits submit. The application returns the predicted next word

screenchot

Building the model

Cleaned and concatenated three corpus files (twitter, blogs, news. total size:700 MB)
Generated datatables for [1-5]grams and their frequencies
Created an extra column for each n-gram: Frequency of the (n-1)gram resulting if I remove the first word

example 3gram: “synthesis”, “of”, “names”, 2, 215
where 2 is: frequency for “synthesis”,“of”,“names”
and 215 is: sub-frequency for bigram “of”,“names”
Kept only n-grams where frequency >=2
Amongst n-grams with common the first (n-1) words, kept only the ones with max frequency (can be more than one in case of a tie)

Algorithm

Use at most the last 4 words entered
For input of k words, use (k+1) gram datatable to look for matches
If no matches found remove the first word and use datatable one rank less. Repeat till find a datatable with one or more matches.
If single match return the last word from the matched row
If there is a tie (eg three quad-grams with same frequency=15), I check the sub-frequency column of the lesser ngrams (eg checking the sub-frequency for the trigrams embedded in the quad-grams)
If no new tie: Return the last word from the highest sub-frequency ngram
If there is again a tie when using sub-frequency column, I check at unigram datatable the frequency of each last word contained in matched rows and return the word with highest frequency

5 Extensions

Use an advanced model like: Kneser-Ney, Good-Turing, Linear Interpolation
Use part of sentence information
Let the user type the first letters of the next word
Optimise loading speed
Better handling of profanity words (now I just return “bleeeep”“)