Jason V
April 23, 2016
A capstone project for the Coursera/Johns Hopkins data science specialization
Text prediction is an important subject within data science with many uses. The average person deals with these dailing in user interfaces such as phones and search engines.
The purpose of this project is to develop a text prediction algorithm and deploy it as an interactive application. Users enter a phrase and the application will predict the next word.
A key challenge was striking the right balance between prediction accuracy and application responsiveness while factoring in resource constraints.
The training source used was the HC Corpora dataset which contains millions of lines of text from twitter feeds, blogs, and news. To prepare the data several steps were taken:
The prediction algorithm used relies on the 'stupid backoff' algorithm chosen because it is computationally efficient while having reasonable accuracy. Steps taken:
The results, playfully called the 'Confabulator' can be found here:
One design choice made early one was that the application should be as responsive as possible. The goal appears to have been achieved: the application is fast enough that at times it will return a prediction while the user is still in the process of typing their word or phrase.