Michael J. Pfammatter
2015-08-23
John Hopkins University
Data Science Coursera Specialization Track
Next word text prediction is important because it can:
This is becoming more and more important as mobile devices get smaller and include only screen-based keyboards that can be limiting as far as entering text.
This proof-of-concept app will use the Kneser-Ney smoothing algorithm to predict three words the user is most likely need to type and provide an easy way to add the correct word to their text.
Kneser-Ney smoothing is a method of text prediction that uses absolute discounting to predict the next most likely word.
\[ P_{abs}(w_{i} \mid w_{i-1})= \frac {max(c(w_{i-1}w_{i})-\delta ,0)} {\sum_{w'}c(w_{i-1}w')} \: + \alpha P_{abs(w_i)} \]
The common example of this is “San Francisco”. If we were predicting by word count alone, “Francisco” might have a high probability as it gets used often in writing. Kneser-Ney smoothing recognizes that it is a frequent word but only when following the word “San” and therefore lowers the probability when following other words.
Visit the What's up next? app located at https://gnomesoup.shinyapps.io/shiny
The app searches a sqlite database of words and their frequency counts from the corpus of texts collected from blogs, twitter, and news articles. These frequency counts are used to calculate the Kneser-Ney Smoothing prediction. The words that are scored highest on a set of tetra-grams (or set of four words) are then returned. If the tetra-gram cannot be found, we then “back-off” to a tri-gram (set of three words) search. Finally, if no bi-gram exists we search for the best match among bi-grams (set of two words).
As you type in the open text box, the app will provide suggestion for the next word. If a presented word is what you are looking for, click it and watch the word get added to your text.
Once you are finished typing, click the “Select all text” and hit Ctrl-C (windows) or Cmd-C (mac) to copy the text to the clipboard. If you prefer, click “Save as text file” to download a “.txt” version of your words.
Go ahead! Start using the app
Here are a list of resources used to create this app.
Corpus data was collected from the HC Corpora Collection
Natural language Processing information was gathered from
The app was written in the R Language using the RStudio environment and the dply, knitr, readr, shiny, shinyjs and stringr packages.