Lori Ziegelmeier
April 26, 2015
A project for the Coursera Data Science Specialization
Goal of This Project: Develop a predictive text application that predicts the next word in a phrase.
Why?: Millions of smart phone and tablet users around the world input text on small devices. With 'fat fingers', inputting text can be cumbersome, and thus, methods–such as predictive text applications–to speed-up typing are warranted.
In Fact: Entire companies, such as our corporate partner SwiftKey, have been formed with this purpose in mind.
The foundation for this application is a corpus of English text drawn from 3 sources: news feeds, blog posts, and Twitter messages.
15% of the entries from each corpus is randomly sampled to construct a subcorpus of over 640,000 lines of text which is cleaned and tokenized.
A table of \( n \)-grams (phrases consisting of \( n \) consecutive words appearing in our corpus) is constructed for \( n=1,\ldots,5 \).
Probabilities based on frequency counts are recorded with each \( n \)-gram, and tables are sorted with decreasing probabilities.
Compilation of the \( n \)-gram tables forms a database which is loaded into the app. Only “look-up” needs to be accomplished inside the app, speeding up computations.
The user inputs a phrase. The app cleans and tokenizes the input phrase.
Only the last four words in the phrase are used to predict the next word.
The phrase is matched with existing phrases in the appropriate \( n \)-gram table. If at least one match exists, the word following the match with the highest probability is our predicted word.
If no match exists, a backoff model is employed, searching through the \( (n-1) \)-gram table, the \( (n-2) \)-gram table, and so on, until a match is found. If no match is found, the application predicts the most common word in the English language, the.
Consider the two examples at right:
In each case, a reasonable prediction was output, and in fact, the top three predicted words were also displayed.
Now, you can try it!
Just go here to enjoy predicting the next word.