Coursera Capstone

Raghu
1/24/2015

Next Word Prediction

  • This Capstone project has been held in collaboration with SwiftKey.
  • Algorithm to predict next possible words while typing a text into an input field

Thank you for visiting

Description of the Algorithm

  • 3 US English large corpus texts (blogs, news and twitter) are used for the exploratory analysis
  • A subset of (5K) of the whole text is analyzed, cleaned up (removed punctuations, stop words, etc.,).
Desc Blog News Twitter
Total# of lines 899288 1010242 2360148
Total Words 37334131 34372530 30373543
File Size(Mb) 200.42 196.28 159.36
Sample# of lines 5000 5000 5000
Sample Word Count 205555 63747 170940
Sample Word Count(after cleanup) 104347 35947 96239

Click here for the analysis details.

Application details

  1. Clean the given phrase (remove stop words, remove punctuation, numbers, etc.).
  2. Quad-gram to Uni-gram, the below steps are performed:
    • A regex is built with the last words and look for the words in a table of n-grams.
    • Returns the n-gram with the maximal occurence.
    • If not found, immediate lesser n-gram is considered .
  3. If a phrase is not found in (Quad, Tri, Bi) then highest frequnecy word in Uni-gram is returned
  4. n-gram options help you to see the freqency plots
  5. Last but not least - Name input box is provided to convey thanks properly

Application usage

  1. Click here for the application
  2. Enter the phrase you want to complete
  3. click [Go] button
  4. Predicted word shows up in the main panel (Righside)
  5. choose n-gram radio option to see the frequency plot

Application Preview

Next Word Prediction