Next word prediction

Coursera JHU Data Science specialisation

Capstone Project A shiny application predicting next word based on previous words entered by the user. Idea based on Swiftkey keyboard for tablets and smartphones

Visit my app here
For questions email me at mkrous@gmail.com

The application

User enters some words in a textfield and hits submit. The application returns the predicted next word

screenchot

Building the model

  • Cleaned and concatenated three corpus files (twitter, blogs, news. total size:700 MB)
  • Generated datatables for [1-5]grams and their frequencies
  • Created an extra column for each n-gram: Frequency of the (n-1)gram resulting if I remove the first word

    example 3gram: “synthesis”, “of”, “names”, 2, 215
    where 2 is: frequency for “synthesis”,“of”,“names”
    and 215 is: sub-frequency for bigram “of”,“names”

  • Kept only n-grams where frequency >=2

  • Amongst n-grams with common the first (n-1) words, kept only the ones with max frequency (can be more than one in case of a tie)

Algorithm

  • Use at most the last 4 words entered
  • For input of k words, use (k+1) gram datatable to look for matches
  • If no matches found remove the first word and use datatable one rank less. Repeat till find a datatable with one or more matches.
  • If single match return the last word from the matched row
  • If there is a tie (eg three quad-grams with same frequency=15), I check the sub-frequency column of the lesser ngrams (eg checking the sub-frequency for the trigrams embedded in the quad-grams)
  • If no new tie: Return the last word from the highest sub-frequency ngram
  • If there is again a tie when using sub-frequency column, I check at unigram datatable the frequency of each last word contained in matched rows and return the word with highest frequency

5 Extensions

  • Use an advanced model like: Kneser-Ney, Good-Turing, Linear Interpolation
  • Use part of sentence information
  • Let the user type the first letters of the next word
  • Optimise loading speed
  • Better handling of profanity words (now I just return “bleeeep”“)